Model Capability Picker
A five-question quiz that recommends the right model for your task, latency and budget.
≈ $51.75/month at 200 calls/day (8k in / 1.5k out per call). The runner-up is close — worth testing both on twenty real examples.
Why this model
- right capability tier for refactor work — no premium wasted
- 180 tok/s keeps a watching human happy
Runner-up: GPT-5.4 miniclose call
$77.62/mo- · right capability tier for refactor work — no premium wasted
- · 150 tok/s keeps a watching human happy
Prices checked 2026-06-11. Real workloads are a mix — routing easy calls to GPT-5.4 mini and hard ones to Gemini 3 Flash is the rule worth automating.
How it works
Five questions, one recommendation. This picker scores every model in our dataset against your answers — task type, latency tolerance, budget posture, context size and tool-use intensity — and returns a primary recommendation plus a runner-up, with the reasoning shown as bullets rather than hidden in a black box. A monthly cost estimate at your stated volume comes attached, priced from our model dataset (checked 2026-06-11).
The scoring logic is deliberately legible. Task type sets the capability floor: architecture and debugging work weights toward frontier-tier models, while boilerplate and documentation open the field to small ones. Latency need rewards tokens-per-second when you say a human is waiting. Budget posture scales the price penalty — cost-first answers punish expensive models hard, quality-first barely at all. Context size eliminates models whose window cannot fit your working set, and tool-use intensity favors models with strong function-calling track records in agentic loops. Each model gets a score; you see the top two and exactly why.
The most useful output is often the gap between the winner and the runner-up. When the runner-up is a mid-tier model within a few points of a frontier pick, that is the picker telling you the decision is close enough to test cheaply: run both on twenty real examples from your workload and let the results decide. When the gap is wide, trust it — sending architecture decisions to a nano-class model to save pennies is a false economy that costs you in rework, and the picker prices that in.
One model for everything is the wrong frame anyway. Real workloads are a mix — some calls need the strongest reasoning available, most do not — and the teams with the best cost-quality curves route per task rather than per project. Use this picker per workload: answer for your documentation pipeline, then again for your debugging agent, and you will usually land on two or three models with clear boundaries between them. Turning those boundaries into enforced routing rules instead of tribal knowledge is what the FORG rule engine is for.
Frequently asked questions
When is a frontier model actually necessary?
When the task involves multi-step reasoning, architectural judgment, debugging across files, or anything where a wrong answer costs more than the price difference. Frontier models earn their 5-20× price premium on hard problems — and waste it on easy ones. The honest test: if a mid-tier model gets the task right 95% of the time and failures are cheap to catch, the frontier premium buys you very little.
When do small models genuinely win?
High-volume, well-specified, low-stakes tasks: generating boilerplate, formatting, summarizing logs, writing commit messages, simple extraction. Small models are not just cheaper — they are 2-4× faster, which matters when a human is waiting, and at thousands of calls per day the cost gap compounds into real money. The pattern that fails is giving a small model an under-specified task and blaming the model for the ambiguity.
What is model routing and should my team do it?
Routing sends each request to the cheapest model that can handle it — docs to a small model, refactors to mid-tier, architecture to frontier — instead of one model for everything. Teams that route typically cut spend 40-70% with no measurable quality loss on the easy tier. The hard part is doing it consistently rather than relying on each developer to remember; codified rules beat good intentions, which is exactly what the FORG rule engine automates.
How often should I re-evaluate my model choice?
Quarterly at minimum, and immediately after any major model release. The price-performance frontier moves fast: today's mid-tier model often matches last year's frontier model at a tenth of the price, so a choice made six months ago is probably leaving money on the table. Keep a small eval set of your real tasks and re-run it against new models — an afternoon of evals can justify a permanent cost cut.
Turn this analysis into a live rule with the FORG rule engine — route models and enforce limits automatically.
Explore the rule engine