Technology / Models & Benchmarks

Technology / Models & Benchmarks
Gemini 2.5, Claude 3.7, and GPT-4.1: I Ran 100 Code-Generation Prompts Against All Three. Here Are the Results.
Jun 13, 2026Article
I got tired of reading vendor benchmarks. I ran 100 real code-generation tasks from my own work — 50 Python, 50 TypeScri
Read more →
Technology / Models & Benchmarks
The MMLU Trap: Why Your Benchmark-Topping Model Is Failing in Production
May 10, 2026Article
A Fortune 100 insurer selected a model ranked first on MMLU for an adjudication assistant, and within six weeks p95 late
Read more →
Technology / Models & Benchmarks
Fine-Tuning vs Prompting in 2026: I Tried Both on the Same Real Product Feature
Apr 16, 2026Article
I took one concrete feature — a Git-style commit message generator — and implemented it three ways: pure prompting, few-
dev poc benchmark
Read more →
Technology / Models & Benchmarks
Stop Treating Leaderboards as Architecture Guidance: Designing Evaluation for Your Own Stack
Apr 14, 2026Article
A team blindly chose the top model from public leaderboards and watched latency, cost, and quality collapse in productio
architecture models benchmarks
Read more →

Gemini 2.5, Claude 3.7, and GPT-4.1: I Ran 100 Code-Generation Prompts Against All Three. Here Are the Results.