Skip to main content

AI Vendor Scorecard

Build a weighted comparison matrix for AI vendors — criteria, weights, scores, verdict.

100% client-side⎘ exportable output⌁ zero network calls
Vendor names
Criteria, weights and 1-5 scores

Weights normalize to 100% automatically. Score anchors: 1 fails, 3 meets, 5 exceeds.

Anthropic

wins with a weighted score of 4.10 vs 4.05 for Google.

Weighted scorecard matrix
CriterionAnthropicOpenAIGoogle
Capability (35%)544
Price (25%)334
Latency (15%)445
Compliance (15%)444
Support (10%)433
Weighted total4.103.654.05

Sensitivity: the winner flips if Price weight rises by 10 points — fragile verdict; treat the finalists as tied.

markdown export, no lock-in
100%
generated locally
0
signup walls
0
network requests per keystroke

How it works

This scorecard builds a weighted comparison matrix for up to four AI vendors. Define your criteria and their weights — the defaults cover capability, price, latency, compliance and support, but every row is editable and you can add your own — then score each vendor 1-5 per criterion. Weights normalize to 100% automatically, weighted totals compute live, and the leading vendor gets the verdict badge. Export the whole matrix as markdown for your decision doc. Everything runs locally in your browser.

The order of operations matters more than the arithmetic. Set weights before you score: a team that agrees capability is worth a third of the decision and compliance a tenth has already made most of the hard choices, and the scoring becomes data entry rather than debate. Weights set after scoring drift, consciously or not, toward whichever vendor someone already prefers — the matrix becomes a justification instead of an instrument. Anchor the 1-5 scale too: 1 fails the requirement, 3 meets it, 5 clearly exceeds it, with written definitions per criterion.

The sensitivity note under the verdict is the feature that keeps the matrix honest. It reports how stable the winner is against weight changes — whether a plausible shift in one criterion's importance would crown a different vendor. A fragile verdict is not a failure of the method; it is the method telling you the finalists are effectively tied and the decision should move to factors the grid cannot hold: contract terms, ecosystem maturity, or the results of a short paid pilot.

Two honest caveats. First, a scorecard is only as good as the evidence behind the scores — benchmark latency yourself, read the compliance docs, and use published pricing rather than impressions. Second, price deserves a reality check beyond the list number: what a vendor costs you depends on how your team actually uses it, which is measurable. FORG tracks real per-vendor and per-developer usage, so by renewal time the price column in this matrix can be your own observed cost rather than a brochure figure.

Frequently asked questions

Why use a weighted scorecard instead of just picking the vendor everyone likes?

Because unstructured vendor decisions are dominated by whoever demos last and whoever argues loudest. A weighted matrix forces the team to agree on what matters before scoring anything, which moves the argument from vendors to priorities — a far more productive fight. It also leaves a written record: when someone asks in a year why you chose vendor B, the scorecard answers in thirty seconds instead of a meeting.

How should I set the criterion weights?

Set them before you score, ideally before you see any demos, and set them as a team. Start from the defaults here — capability heaviest, then price, latency, compliance and support — and adjust for your context: a healthcare company should pull compliance up sharply, a latency-sensitive product should promote latency. The tool normalizes whatever you enter to 100%, so think in relative importance rather than precise percentages. If two stakeholders disagree on a weight, that disagreement is the real decision; surface it now.

What does the sensitivity note tell me?

Whether your verdict is robust or fragile. The tool checks how the winner changes as each criterion's weight shifts, and flags the smallest weight change that would flip the result. A winner that survives large weight swings is a safe decision; a winner that flips when one weight moves a few points means the top vendors are effectively tied, and you should decide on something the matrix does not capture — ecosystem, contract terms, or a paid pilot.

How do I score fairly on a 1-5 scale?

Anchor the scale before anyone scores: 1 means fails your requirement, 3 means meets it, 5 means clearly exceeds it — and write down what 'meets' means per criterion. Score from evidence where possible: published pricing for price, your own benchmark runs for latency, the vendor's compliance documentation for compliance. Have evaluators score independently before comparing, because group scoring converges on the first opinion voiced rather than the best one.

Should price be a scored criterion or a separate negotiation?

Both, in sequence. Score list price in the matrix so cost discipline shapes the shortlist, but remember the number you eventually pay is negotiated, especially at team and enterprise scale. A useful pattern: use the scorecard to get to two finalists, then negotiate both in parallel — vendors price differently when they know a scored competitor is one column away. And measure your actual usage first; per-seat list prices mean little until you know your real volume.

Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.

Learn about FORG