The meeting is always the same. Someone proposes an AI assistant for intent classification, entity extraction, document routing. Tasks that are bounded, well-defined, deterministic.
The solution? The newest frontier model. Sonnets and GPTs of the day. Whatever made headlines that week.
It happened again recently. The use case was straightforward: route requests, extract fields, summarize. The proposed solution was a top-tier model for every call.
When someone suggested smaller models with verification, the response was predictable: “we want the best.”
“Best” meant most expensive. The tasks didn’t need it.
The Golden Hammer
There’s a name for this in software engineering: the Golden Hammer antipattern. When you’ve got a powerful tool you trust, you reach for it first. The question isn’t “what does this task need?” — it’s “how do I make my tool fit this problem?”
Enterprise AI meetings follow the pattern. A new model launches with impressive benchmarks. Someone proposes using it. The discussion focuses on capabilities, not requirements.
We’ve all been in these meetings. We’ve all seen the splash headlines. We’ve all watched the default become “start with the most powerful option.”
The Pattern Has Numbers
Stanford’s FrugalGPT paper showed something most companies miss: on question-answering benchmarks, cascade routing — trying small models first, escalating to frontier only when needed — achieved up to 98% cost reduction while matching GPT-4 quality source.
LeanLM’s analysis puts the waste at 50–90% of enterprise LLM inference spend source. Not future spend. Current spend. Money already being burned.
The Cake.ai team frames it cleanly: frontier models are overkill for most enterprise workloads source.
| Task | Appropriate Model | What Companies Use |
|---|---|---|
| Intent classification | 7B-13B fine-tuned | GPT-4 / Claude Opus |
| Entity extraction | Small model + validation | Frontier model |
| Document routing | Rule-based or 7B | Frontier model |
| Creative synthesis | Frontier model | Frontier model ✓ |
The last row is the only one where frontier earns its cost. Everything above it is paying premium prices for commodity inference.
Why This Happens
The overspend isn’t irrational. There are real reasons teams default to frontier:
Evaluation is hard. Knowing whether a smaller model is “good enough” requires building evals, collecting representative data, running comparisons. That’s engineering overhead that doesn’t ship features.
Cost is invisible until it’s not. At early scale, LLM spend rounds to zero. The problem becomes visible at $10K/month, and by then the patterns are baked into production code.
Risk asymmetry. A quality regression from switching models is visible and blamed on the engineer who made the change. A 3x higher-than-necessary cost is invisible and blamed on “AI being expensive.” The incentives favor over-modeling.
The result: companies ship fast by reaching for the most capable model. The first call works. The pattern sticks. Six months later, they’re routing classification tasks through a model that costs $15 per million output tokens when a purpose-built alternative would cost $0.40 and produce identical results.
The Multi-Agent Alternative
My suggestion in the meeting wasn’t “use smaller models and hope.” It was “use smaller models and verify.”
The adversarial pattern I’ve written about before applies here too. Judge, Challenger, Fact-Checker. Three agents running on smaller models, each validating the other’s output.
What this costs: The exact math depends on your task and tokens, but the pattern holds: three smaller model calls often cost a fraction of one frontier call. If a 7B model runs at $0.40/M output tokens and a frontier model at $15/M, even tripling the calls leaves you ahead.
What this buys: Verification. Each agent checks the other’s work. The Judge produces an assessment. The Challenger finds gaps. The Fact-Checker verifies claims. A human synthesizes.
You’re not hoping the model gets it right. You’re designing a system that catches when it doesn’t. Whether this beats frontier quality depends on your task — but for structured work with clear success criteria, the verification pattern often produces more reliable outputs than a single frontier call.
Where Frontier Earns Its Cost
Not everything should run on small models. The research is clear about which tasks justify frontier pricing:
| Task | Why Frontier |
|---|---|
| Multi-step reasoning with ambiguity | Requires working through incomplete inputs |
| Synthesis requiring broad knowledge | Need the training data breadth |
| Creative generation where novelty matters | Smaller models regress to averages |
| Problems where correctness can’t be defined upfront | Need the model to figure it out |
Most enterprise AI pipelines have one task like this. Maybe two. The rest — classification, extraction, routing, summarization — are pattern recognition with clear success criteria. Smaller models don’t just match frontier quality on these. They sometimes exceed it, because frontier models introduce unnecessary “creativity” on deterministic tasks.
The Production Evidence
This isn’t theoretical. Companies running smaller models in production:
| Company | Approach | Result |
|---|---|---|
| Checkr | Llama-3-8B fine-tuned (replaced GPT-4) | 5× cost reduction, 30× faster source |
| E-commerce unicorn | Mistral-7B fine-tuned (via Airtrain) | 94% cost reduction, improved accuracy source |
| Convirza | LoRA-fine-tuned Llama-3-8B | 10× cost reduction vs OpenAI, +8% F1 source |
The e-commerce case is instructive. Product categorization — structured, bounded, clear success criteria. A fine-tuned 7B model improved accuracy from 47% to 94% while cutting costs dramatically compared to GPT-4.
The Fix Isn’t Hard
The companies seeing results follow a pattern:
-
Classify your tasks. Which are pattern recognition? Which need reasoning breadth?
-
Start small. Try a 7B or 13B model on each task. Measure quality, not just cost.
-
Add verification. Multi-agent patterns or explicit validation steps catch hallucinations.
-
Route by complexity. Use model routing (RouteLLM, semantic caching) to escalate to frontier only when needed.
-
Measure. Track task type, latency, retries, cost per call. You can’t optimize what you can’t see.
The first three steps cost almost nothing. The savings compound fast.
The Takeaway
Frontier models are incredible. They’re also overkill for most of what enterprise teams use them for.
The pattern I’ve seen across domains is the same here: we reach for the most powerful tool because evaluating whether we need it is harder than just using it. The cost shows up later, in the invoice, and by then it’s someone else’s problem.
Same pattern. Different story.
The companies getting this right aren’t being cheap. They’re being precise. They use frontier models for frontier tasks. Everything else gets what it needs.
Series: This is part of the pattern recognition series. See also:
- I See Patterns — the meta-pattern
- From Vinny’s Courtroom to Editor’s Desk — adversarial review in document review
The Golden Hammer has a way of making every problem look like it needs the same solution. In enterprise AI, that solution is increasingly expensive.