Skip to main content

LLM Eval Rubric Builder

Build a structured evaluation rubric and export it as JSON or markdown for your evals.

100% client-side⛁ data verified 2026-06-11⌁ zero network calls
%

Criteria

raw weights sum to 100 — normalized to 100% in exports
{
  "task": "Summarize a customer support ticket into a 3-bullet handoff note for the next agent.",
  "passThresholdPercent": 70,
  "scale": {
    "min": 1,
    "max": 5,
    "note": "Scores 2 and 4 interpolate between the written anchors."
  },
  "criteria": [
    {
      "name": "Correctness",
      "weightPercent": 50,
      "anchors": {
        "1": "Contains factual errors or fails the core task",
        "3": "Core task correct; minor omissions or imprecision",
        "5": "Fully correct, complete, and verifiable"
      }
    },
    {
      "name": "Format compliance",
      "weightPercent": 30,
      "anchors": {
        "1": "Output ignores the requested structure",
        "3": "Mostly follows structure; small deviations",
        "5": "Exactly matches the requested schema/format"
      }
    },
    {
      "name": "Conciseness",
      "weightPercent": 20,
      "anchors": {
        "1": "Padded with filler, repetition or hedging",
        "3": "Some redundancy but acceptable length",
        "5": "Every sentence earns its place"
      }
    }
  ]
}
18
models in the dataset
2026-06-11
reference data verified
100%
logic runs in your browser
0
network requests per keystroke

How it works

Describe the task, add criteria rows with weights and concrete 1/3/5 score anchors, set a pass threshold, and watch the rubric assemble live. Weights normalize to 100% automatically, and the finished rubric exports two ways: JSON for your eval harness or LLM-as-judge prompt, and markdown for the humans who need to apply the same standard.

The structure encodes the lessons of teams who run evals at scale. Criteria are separated because a single "quality 1-10" score hides what actually went wrong — an output can be factually perfect and stylistically unusable, and you need to know which. Weights exist because criteria are never equally important, and pretending they are means your aggregate score rewards polishing trivia. The anchors are the heart of it: a score of 3 means nothing until you write down what a 3 looks like, in observable terms, for this specific criterion on this specific task. Graders interpolate the even scores from the anchors you define.

The pass threshold turns a score into a decision. Evals exist to answer "can we ship this change?" — a weighted score of 78% answers nothing until somewhere it is written that 75% is the bar. Putting the threshold inside the exported rubric, rather than in a separate config or someone's head, means every consumer of the rubric — CI gate, dashboard, human reviewer — applies the same bar. When you tune the threshold after calibration runs, the rubric version changes with it, and history stays auditable.

Start smaller than feels rigorous. Three or four criteria with sharp anchors beat eight vague ones, because every additional criterion dilutes the weight of the ones that matter and adds grading cost on every example forever. The builder makes adding rows easy precisely so you can feel the dilution in the live preview — watch the normalized weights shrink as you add — and decide whether each new criterion earns its share.

Frequently asked questions

Why anchor the 1-5 scale with descriptions?

Because an unanchored number is a vibe, not a measurement. If two graders — human or LLM-as-judge — read 'correctness: 3' without a definition, they will disagree about what a 3 looks like, and your eval scores become noise. Writing concrete descriptions for scores 1, 3 and 5 forces you to define failure, mediocrity and excellence in observable terms, and graders interpolate 2 and 4 consistently from those anchors. It is the single highest-leverage habit in rubric design.

How should I choose criterion weights?

Weight by consequence, not by ease of grading. A factual error in a medical summary matters more than a tone slip, so correctness should dominate; in a brand-voice copywriting eval the reverse may hold. A practical method: imagine two outputs, one failing only criterion A and one failing only criterion B, and ask which you would ship. The tool normalizes whatever you enter to 100%, so think in ratios — correctness twice as important as style — rather than exact percentages.

What is a sensible pass threshold?

Start at 70% of the maximum weighted score and calibrate against real outputs. Run the rubric over a dozen examples you have already judged informally: if outputs you would ship are failing, the threshold is too strict or a criterion's anchors are miscalibrated; if outputs you would reject are passing, tighten it. The threshold is a dial you tune empirically, not a constant you get right on the first try — which is exactly why it is exported as part of the rubric.

How do I use the JSON export in an eval pipeline?

The JSON is deliberately framework-neutral: a task description, an array of criteria with normalized weights and per-score anchors, and a pass threshold. Feed it to an LLM-as-judge prompt by interpolating the criteria and anchors into the judge's instructions, then have the judge emit one score per criterion; your harness computes the weighted total and compares it to the threshold. The same file drops into promptfoo, Braintrust or a homegrown runner with minimal glue.

When should I use the markdown export instead?

Whenever a human is in the loop. The markdown renders the rubric as a table for code review, documentation sites or annotator guidelines, so human graders see exactly the same criteria and anchors the automated judge uses. Keeping the two exports generated from one source means your human and machine evaluations cannot silently drift apart — a depressingly common failure mode in teams that maintain grading docs and judge prompts separately.

Turn this analysis into a live rule with the FORG rule engine — route models and enforce limits automatically.

Explore the rule engine