Methodology

How the score works.

PipelineScore is a 0–100 number computed from five categories that mirror real-world LLM workloads, reported with a confidence band so two models that are statistically tied read as tied. Every task is checked objectively, so the whole benchmark runs on your machine with no API key, then the result uploads to the board. Pick a weighting profile to rank for your use case.

The formula

category_score = mean(task scores, 0-100)  ± 95% confidence band
PipelineScore  = Σ (category_score × profile_weight)   (weights sum to 1.0)

Each category is the mean of its task scores with a 95% confidence band (Student-t for small samples), so a noisy category shows a wider band and the band narrows as you add tasks. Every task is scored deterministically on your machine: code is executed against hidden test cases, answers are exact-matched, and tool calls and extractions are JSON-matched. There is no judge model and no API key. Speed is throughput (tokens/sec), a rate, so it does not reward terse answers. The composite is computed per weighting profile (balanced, coding, agentic, local-first); the board leads with the per-category profile.

The five categories

Code

28%

Code generation, debugging, refactoring — the model's code is executed against hidden test cases.

Reason

22%

Multi-step reasoning, math, logic — the final answer is exact-matched.

Tool Use

18%

Function-calling: the emitted tool call is JSON-matched against the expected structure.

RAG

17%

Grounded extraction and structured answering, JSON-matched against the context.

Speed

15%

Throughput (tokens/sec) measured on your hardware during the run.

The tiers

Five tiers, named after the parts of a real industrial pipeline. A score maps to exactly one tier — no overlap, no ambiguity.

TRUNK

90 – 100

Top of the heap — main industrial line.

MAINLINE

75 – 89

Excellent and reliable service line.

FEEDER

60 – 74

Solid, capable secondary line.

TAP

40 – 59

Functional small-branch connection.

DRIP

0 – 39

Minimal flow — weak.

Config tags — same model, different setups

A vanilla claude-opus-4-7 and the same model wrapped in your custom system prompt or LoRA adapter are not the same thing. They should not collide on the board.

The CLI accepts a --config-tag flag — a short, free-form string like system-prompt-coder, lora-domain-finance, or temp-zero. The leaderboard shows it as a separate row from the base-model run, and you get a real apples-to-apples view of how much your customization actually moved the score.

Base-model submissions leave the tag blank. The "base" marker on the Users Leaderboard means a default, unmodified run.

Anti-cheat & integrity

Public set, private held-out set. The community task set is open and reproducible — it ships bundled with the CLI, so a run is exactly the published tasks. A separate private, rotating held-out set is used for canonical lab-verified runs, so the trusted ranking cannot be trained on or pre-tuned against.
Community vs lab-verified. Community submissions are computed locally by the CLI and are labeled community, not verified — treat them as directional. The trusted ranking is the lab-verified tier, run by the lab against the private rotating held-out task set so it cannot be pre-tuned against.
Layered rate limits. 20 submissions per IP per hour, 100 per nickname per day, and 5 per (nickname, model) per hour. 429s return a stamped JSON error identifying which layer fired, with a Retry-After header.
Lab-verified flag. A small set of runs published by Charles & Roe under controlled conditions are tagged Lab on the board. Community submissions stay on the leaderboard permanently — the score is the score.
Nicknames aren't authenticated, by design. We don't verify that the person submitting as "karpathy" is Andrej. The leaderboard is signal-grade, not reputation-graded. If someone impersonates you, email privacy@pipelinescore.ai and we'll redact on a good-faith basis.
30-day transcript retention. Raw prompt + model-output bodies are kept for 30 days for audit, then overwritten. The score row is permanent. See Privacy for the full policy.