PipelineScore
Methodology

How the score works.

PipelineScore is a deterministic 0–100 number, computed from six weighted categories that mirror real-world LLM workloads. Same test pack, same judge, same math — every model gets the same shot.

The formula

PipelineScore = Σ (category_score × weight)

category_score = normalized 0-100 against the v1 anchor
weights sum to 1.0

Deterministic tests (code execution, exact-match, schema validation) score pass/fail with a difficulty multiplier. Subjective tests (writing quality, summarization) are scored 0–10 against a rubric by a held-out judge model.

The six categories

Code
25%
Code generation, debugging, refactoring, test writing — runnable, gradeable code.
Reason
20%
Multi-step reasoning, math, logic, strict instruction following.
Write
15%
Drafting, summarization, style adherence, tone control.
Tool Use
15%
Function calling correctness, parameter selection, orchestration, graceful refusal.
RAG
12%
Grounded answering, citation accuracy, faithfulness, refusal to fabricate.
Speed
13%
Latency (p50, p95) and tokens/sec under a standardized load.

The tiers

Five tiers, named after the parts of a real industrial pipeline. A score maps to exactly one tier — no overlap, no ambiguity.

TRUNK
90100
Top of the heap — main industrial line.
MAINLINE
7589
Excellent and reliable service line.
FEEDER
6074
Solid, capable secondary line.
TAP
4059
Functional small-branch connection.
DRIP
039
Minimal flow — weak.

Config tags — same model, different setups

A vanilla claude-opus-4-7 and the same model wrapped in your custom system prompt or LoRA adapter are not the same thing. They should not collide on the board.

The CLI accepts a --config-tag flag — a short, free-form string like system-prompt-coder, lora-domain-finance, or temp-zero. The leaderboard shows it as a separate row from the base-model run, and you get a real apples-to-apples view of how much your customization actually moved the score.

Base-model submissions leave the tag blank. The "base" marker on the Users Leaderboard means a default, unmodified run.

Anti-cheat & integrity

  • Public taxonomy, private prompts. The six categories and their task types are open. The exact prompts rotate daily and are HMAC-signed per-day, so a cached pack from yesterday won't pass today's server-side check.
  • Server-side re-judgment. Every submission is re-graded centrally by a held-out judge model (Claude Haiku 4.5). The CLI's local score is provisional — only the server's number lands on the board.
  • Layered rate limits. 20 submissions per IP per hour, 100 per nickname per day, and 5 per (nickname, model) per hour. 429s return a stamped JSON error identifying which layer fired, with a Retry-After header.
  • Lab-verified flag. A small set of runs published by Charles & Roe under controlled conditions are tagged Lab on the board. Community submissions stay on the leaderboard permanently — the score is the score.
  • Nicknames aren't authenticated, by design. We don't verify that the person submitting as "karpathy" is Andrej. The leaderboard is signal-grade, not reputation-graded. If someone impersonates you, email privacy@pipelinescore.ai and we'll redact on a good-faith basis.
  • 30-day transcript retention. Raw prompt + model-output bodies are kept for 30 days for audit, then overwritten. The score row is permanent. See Privacy for the full policy.