About

The missing axis: where the model is running.

Every other public LLM leaderboard tells you which model is best in the abstract. LMArena measures preference votes. Artificial Analysis aggregates centrally-run benchmarks. Academic tests (MMLU, SWE-Bench, TerminalBench) are lab-controlled and decay fast as models train on them.

None of them answer the question that actually matters to anyone running a model: how well does this model run on the hardware I have? Llama 4 on an M3 Max is not the same product as Llama 4 on a 3090 or Llama 4 on an A100. Same weights, three radically different real-world experiences.

PipelineScore is the local-first benchmark. You point a CLI at your local model server (Ollama, LM Studio, MLX, llama.cpp), tag your hardware, and the 34-task suite runs against your rig. The score and tier land on a public leaderboard that's grouped by (model, hardware) — so the M3 Max + Llama 4 cohort gets its own row, separate from the 3090 + Llama 4 cohort.

Five categories — code, reasoning, tool use, RAG, and speed — every one deterministic: code is executed, the rest is exact-match or measured. Five tiers — TRUNK through DRIP — borrowed from the language of actual pipelines. Frontier API runs (Anthropic, OpenAI) work too, but the unique value lives in the local-hardware comparisons no one else publishes.

Open source under Apache 2.0. Code at github.com/drewmattie-code/pipelinescore. Built by an operator who runs LLMs on a small fleet of machines and wanted a way to compare them.

Run a benchmark See the leaderboard