How the score works.
PipelineScore is a deterministic 0–100 number, computed from six weighted categories that mirror real-world LLM workloads. Same test pack, same judge, same math — every model gets the same shot.
The formula
PipelineScore = Σ (category_score × weight) category_score = normalized 0-100 against the v1 anchor weights sum to 1.0
Deterministic tests (code execution, exact-match, schema validation) score pass/fail with a difficulty multiplier. Subjective tests (writing quality, summarization) are scored 0–10 against a rubric by a held-out judge model.
The six categories
The tiers
Five tiers, named after the parts of a real industrial pipeline. A score maps to exactly one tier — no overlap, no ambiguity.
Config tags — same model, different setups
A vanilla claude-opus-4-7 and the same model wrapped in your custom system prompt or LoRA adapter are not the same thing. They should not collide on the board.
The CLI accepts a --config-tag flag — a short, free-form string like system-prompt-coder, lora-domain-finance, or temp-zero. The leaderboard shows it as a separate row from the base-model run, and you get a real apples-to-apples view of how much your customization actually moved the score.
Base-model submissions leave the tag blank. The "base" marker on the Users Leaderboard means a default, unmodified run.
Anti-cheat & integrity
- Public taxonomy, private prompts. The six categories and their task types are open. The exact prompts rotate daily and are HMAC-signed per-day, so a cached pack from yesterday won't pass today's server-side check.
- Server-side re-judgment. Every submission is re-graded centrally by a held-out judge model (Claude Haiku 4.5). The CLI's local score is provisional — only the server's number lands on the board.
- Layered rate limits. 20 submissions per IP per hour, 100 per nickname per day, and 5 per (nickname, model) per hour. 429s return a stamped JSON error identifying which layer fired, with a
Retry-Afterheader. - Lab-verified flag. A small set of runs published by Charles & Roe under controlled conditions are tagged Lab on the board. Community submissions stay on the leaderboard permanently — the score is the score.
- Nicknames aren't authenticated, by design. We don't verify that the person submitting as "karpathy" is Andrej. The leaderboard is signal-grade, not reputation-graded. If someone impersonates you, email privacy@pipelinescore.ai and we'll redact on a good-faith basis.
- 30-day transcript retention. Raw prompt + model-output bodies are kept for 30 days for audit, then overwritten. The score row is permanent. See Privacy for the full policy.