Question 1

Why anchor the 1-5 scale with descriptions?

Accepted Answer

Because an unanchored number is a vibe, not a measurement. If two graders — human or LLM-as-judge — read 'correctness: 3' without a definition, they will disagree about what a 3 looks like, and your eval scores become noise. Writing concrete descriptions for scores 1, 3 and 5 forces you to define failure, mediocrity and excellence in observable terms, and graders interpolate 2 and 4 consistently from those anchors. It is the single highest-leverage habit in rubric design.

Question 2

How should I choose criterion weights?

Accepted Answer

Weight by consequence, not by ease of grading. A factual error in a medical summary matters more than a tone slip, so correctness should dominate; in a brand-voice copywriting eval the reverse may hold. A practical method: imagine two outputs, one failing only criterion A and one failing only criterion B, and ask which you would ship. The tool normalizes whatever you enter to 100%, so think in ratios — correctness twice as important as style — rather than exact percentages.

Question 3

What is a sensible pass threshold?

Accepted Answer

Start at 70% of the maximum weighted score and calibrate against real outputs. Run the rubric over a dozen examples you have already judged informally: if outputs you would ship are failing, the threshold is too strict or a criterion's anchors are miscalibrated; if outputs you would reject are passing, tighten it. The threshold is a dial you tune empirically, not a constant you get right on the first try — which is exactly why it is exported as part of the rubric.

Question 4

How do I use the JSON export in an eval pipeline?

Accepted Answer

The JSON is deliberately framework-neutral: a task description, an array of criteria with normalized weights and per-score anchors, and a pass threshold. Feed it to an LLM-as-judge prompt by interpolating the criteria and anchors into the judge's instructions, then have the judge emit one score per criterion; your harness computes the weighted total and compares it to the threshold. The same file drops into promptfoo, Braintrust or a homegrown runner with minimal glue.

Question 5

When should I use the markdown export instead?

Accepted Answer

Whenever a human is in the loop. The markdown renders the rubric as a table for code review, documentation sites or annotator guidelines, so human graders see exactly the same criteria and anchors the automated judge uses. Keeping the two exports generated from one source means your human and machine evaluations cannot silently drift apart — a depressingly common failure mode in teams that maintain grading docs and judge prompts separately.

LLM Eval Rubric Builder

Criteria

How it works

Frequently asked questions

Why anchor the 1-5 scale with descriptions?

How should I choose criterion weights?

What is a sensible pass threshold?

How do I use the JSON export in an eval pipeline?

When should I use the markdown export instead?

Related tools

Structured Output Validator

Model Capability Picker

System Prompt Linter

AI Adoption Metrics Calculator