Why Model Selection Matters More Than You Think
Choosing the wrong AI model for your production system is expensive in ways that do not show up immediately. You pick a model, build your application around it, tune your prompts, and ship. Six months later you realize the model is too expensive for your margin, too slow for your users, or too unreliable for your use case. Migrating to a different model means rewriting prompts, re-evaluating quality, and potentially redesigning parts of your architecture.
The evaluation process I describe here takes a week of focused work. That week saves months of technical debt and thousands in unnecessary costs. I have been through this process multiple times and these are the evaluation criteria that actually matter.
The Evaluation Framework
Most model evaluations focus exclusively on quality. Quality matters, but it is one of five dimensions you need to evaluate:
- Quality — Does the model produce good outputs for your specific task?
- Cost — What does it cost per request and how does that scale?
- Latency — How fast does the model respond, including time to first token?
- Reliability — How often does the model fail, timeout, or produce unusable output?
- Operability — How easy is the model to deploy, monitor, and maintain?
Evaluating all five dimensions prevents the common mistake of optimizing for one dimension at the expense of others.
Dimension 1: Quality Evaluation
Build a Test Suite, Not a Vibe Check
The biggest mistake in model evaluation is judging quality by running a handful of prompts and eyeballing the results. This is a vibe check. You need a test suite.
Step 1: Collect representative inputs. Gather one hundred to two hundred inputs that represent the actual distribution your system will see in production. Include:
- Common cases (the bread and butter of your application)
- Edge cases (unusual inputs that still need correct handling)
- Adversarial cases (inputs designed to confuse or break the model)
- Different lengths, formats, and complexity levels
Step 2: Define expected outputs. For each input, define what a good output looks like. This can be:
- An exact expected answer (for factual questions)
- A set of criteria the output must meet (for generative tasks)
- A reference output that serves as the quality standard
Step 3: Automate scoring. Create automated evaluation metrics:
- Exact match for factual tasks
- Format compliance for structured output tasks
- Semantic similarity for open-ended tasks
- LLM-as-judge for quality assessment (use a stronger model to evaluate outputs)
Step 4: Run every model against the same test suite. This gives you a comparable quality score across models.
The LLM-as-Judge Approach
For many tasks, the best evaluation method is having a strong model rate the outputs of the models you are evaluating. The judge model receives:
- The original input
- The model's output
- The evaluation criteria
- A scoring rubric
This is not perfect, but it scales and is more consistent than human evaluation for most tasks. Validate the judge's assessments with human spot-checks.
Quality Across Categories
Do not aggregate quality into a single score. Break it down:
- How does each model perform on common cases vs. edge cases?
- Which categories of inputs does each model handle best?
- Where does each model fail most badly?
You might find that Model A is better overall but Model B handles your most important category better. This nuance matters.
Dimension 2: Cost Analysis
Calculate the Real Cost Per Request
Model pricing is usually per-token, but the real cost per request depends on your specific usage pattern:
- Input token count (your prompts)
- Output token count (the model's responses)
- System prompt length (often the largest cost driver)
- Average requests per user session
- Total request volume
Calculate the cost per request for each model using your actual prompt templates and expected output lengths. Do not use the average — use your specific numbers.
Project Monthly Costs at Scale
Multiply the per-request cost by your expected volume at different scale levels:
- Current volume
- Three-month projected volume
- Twelve-month projected volume
A model that costs a few cents per request becomes expensive at scale. A model that seems expensive per-request might be cheaper overall if it produces shorter, more efficient outputs.
Factor in Prompt Engineering Savings
Some models require longer prompts or more examples to achieve the same quality. This affects cost. A model that performs well with a short prompt may be cheaper overall than a model with a lower per-token rate that needs verbose prompts.
Dimension 3: Latency Testing
Measure What Users Experience
Latency has multiple components:
- Time to first token (TTFT): How long before the model starts generating? This determines perceived responsiveness in streaming applications.
- Tokens per second: How fast does the model generate once it starts?
- Total response time: How long until the complete response is available?
For streaming applications, TTFT matters most. For batch processing, total response time matters most. For API responses, both matter.
Test Under Load
Latency at low volume is different from latency at scale. Test with:
- Single requests (baseline)
- Concurrent requests matching your expected load
- Burst traffic patterns
Many models exhibit significant latency degradation under concurrent load. This is especially important for models served through shared infrastructure.
Measure Tail Latency
The average latency is not what your users experience. The p95 and p99 latency determine the experience of your most frustrated users. A model with good average latency but terrible p99 latency will generate complaints.
Dimension 4: Reliability Assessment
Error Rates
Track how often each model:
- Returns an error response
- Times out
- Returns malformed output (invalid JSON, truncated response, etc.)
- Returns output that is technically valid but useless (empty response, repetitive text, etc.)
Run enough requests to get statistically meaningful error rates. A hundred requests is not enough — aim for at least a thousand.
Output Consistency
For the same input, how consistent is the model's output? Run the same inputs multiple times and measure variance. Some applications need high consistency (structured data extraction). Others tolerate variance (creative writing).
Degradation Patterns
Does the model degrade gracefully or catastrophically? When it fails, does it produce slightly worse output or completely unusable garbage? Graceful degradation is much easier to handle in production.
Dimension 5: Operability
API Stability and Versioning
How does the provider handle model updates? Can you pin to a specific model version? Will your prompts break when the model is updated? Providers with clear versioning and deprecation policies are significantly easier to work with.
Monitoring and Observability
Does the provider offer usage dashboards, cost tracking, and quality metrics? Can you export logs for your own monitoring? The ability to debug production issues quickly depends on the observability tools available.
Rate Limits and Quotas
What are the rate limits for each model? Can you get higher limits if needed? Rate limit handling should be part of your evaluation — test what happens when you hit the limit.
Fallback Compatibility
If your primary model goes down, how easy is it to fall back to an alternative? Models with similar APIs and prompt formats make fallback strategies simpler.
The Evaluation Process
Here is the practical process I follow:
- Day 1: Define your evaluation criteria and build the test suite
- Day 2: Run quality evaluations across all candidate models
- Day 3: Run cost analysis and latency testing
- Day 4: Run reliability tests (higher volume, extended duration)
- Day 5: Analyze results, make decision, document rationale
Document everything. Your evaluation data becomes invaluable when you need to re-evaluate in six months.
FAQ
How often should I re-evaluate models?
Every three to six months, or whenever a major new model is released. The model landscape changes fast, and your optimal choice today may not be optimal in six months.
Should I use multiple models in production?
Yes, if different parts of your application have different requirements. A cheap, fast model for simple classification and a more capable model for complex generation is a common and cost-effective pattern.
What about open-source models?
Include them in your evaluation. Self-hosted open-source models can be significantly cheaper at scale and give you more control. The tradeoff is operational overhead. Evaluate whether your team can handle the infrastructure.
How do I handle model deprecation?
Always have a migration plan. Pin to specific model versions, maintain your evaluation test suite, and test new versions before switching. Providers typically give advance notice before deprecating a model.