Skip to main content

System Prompt Linter

Grade your system prompt A to F: contradictions, vague verbs, bloat and missing specs.

100% client-side⛁ data verified 2026-06-11⌁ zero network calls
never leaves your browser

99 tokens (chars ÷ 4) · 9 lines

F

8 findings3 high-severity. Fix contradictions and placeholders first; they cost the most.

  • Contradiction · highL5: Line 2 says "always" and line 5 says "never" about "respond" — the model will resolve this unpredictably.
  • Contradiction · highL5: Line 6 says "always" and line 5 says "never" about "respond" — the model will resolve this unpredictably.
  • Vague verb · mediumL3: "handle" gives the model nothing testable — replace with a concrete behaviour ("on error, return …").
  • Vague verb · mediumL4: "deal with" gives the model nothing testable — replace with a concrete behaviour ("on error, return …").
  • Vague verb · mediumL8: "manage" gives the model nothing testable — replace with a concrete behaviour ("on error, return …").
  • Vague verb · mediumL9: "try to" gives the model nothing testable — replace with a concrete behaviour ("on error, return …").
  • Duplicate line · lowL6: Identical to line 2 — duplicated instructions waste context and signal copy-paste drift.
  • Placeholder · highL7: Leftover TODO/placeholder text — this was never meant to ship to the model.
18
models in the dataset
2026-06-11
reference data verified
100%
logic runs in your browser
0
network requests per keystroke

How it works

Paste a system prompt and get a grade from A to F plus a list of concrete findings, each tied to a line number. The linter runs six heuristic rules entirely in your browser — nothing is uploaded — and a deliberately flawed example is prefilled so you can see every rule fire before trying your own prompt.

The rules target the defects that actually break prompts in production. Contradictory absolutes are the worst offender: when one line says "always respond in JSON" and a later line says "never use JSON for errors", the model picks a side unpredictably, and you see it as flaky behaviour you cannot reproduce. Vague verbs — handle, ensure, manage, deal with — read fine to humans but give the model nothing testable; "handle errors gracefully" produces wildly different behaviour than "on error, return {"error": message} and nothing else". A missing output-format section is the most common omission of all: if you do not specify the shape of the answer, the model invents one, and it will invent a different one tomorrow.

The bloat check estimates tokens at one per four characters — the standard English approximation — and flags prompts past 2,000 tokens. That threshold is not arbitrary: long prompts cost you on every call, and instruction-adherence research consistently shows mid-prompt rules getting less attention than rules at the start and end. Duplicate-line detection catches the copy-paste residue that accumulates as prompts are edited by committee, and the placeholder check finds the TODO and [INSERT X] fragments that have genuinely shipped to production more times than anyone admits.

Treat the grade as a pre-flight check, not a substitute for evals. A prompt can pass every structural rule and still produce poor outputs for your task — that requires testing against real inputs. But the inverse holds reliably: a prompt with contradictions and shipped placeholders will misbehave no matter how good the underlying idea is. Lint first, fix what is mechanical, then spend your eval budget on the judgment calls that need a human.

Frequently asked questions

What rules does the linter check?

Six heuristic rules, all running locally in your browser: contradictory absolutes (an 'always X' and a 'never X' targeting the same verb), vague verbs that give the model no testable instruction (handle, ensure, deal with, be smart about), a missing output-format section, token bloat (prompts over roughly 2,000 tokens get penalised), duplicate lines that waste context, and leftover TODO/FIXME/placeholder text that was never meant to ship.

What does the grade actually mean?

It is a weighted score, not a judgment of your product. An A means no findings worth acting on; B-C means a handful of fixable issues like a vague verb or one duplicate; D-F means structural problems — contradictions or shipped placeholders — that measurably degrade model behaviour. Contradictions weigh the most because the model resolves them unpredictably, which shows up as inconsistent output between runs.

Is there a sweet spot for system prompt length?

For most production prompts, 200-800 tokens covers role, constraints and output format with room to spare. Beyond about 2,000 tokens, instruction-following degrades: the model attends less to mid-prompt rules and you pay the full token cost on every single call. If you genuinely need more, structure it with headings and put the non-negotiable rules at the start and end, where attention is strongest.

How should I test a system prompt beyond linting?

Linting catches structural defects; behaviour needs evals. Build a small set of representative inputs — including adversarial ones — and check outputs against expectations every time the prompt changes, exactly like a regression test suite. Even ten well-chosen cases catch most prompt regressions. Pin the model version while iterating, since provider model updates change behaviour underneath an unchanged prompt.

Does my prompt leave the browser?

No. All six rules are regex and heuristic checks implemented in client-side JavaScript — there is no network call, no logging and no storage. You can verify this in your browser's network tab while linting. This matters because production system prompts often contain proprietary product logic that should not transit a third-party server just to get a quality check.

Turn this analysis into a live rule with the FORG rule engine — route models and enforce limits automatically.

Explore the rule engine