
Benchmark LLMs on YOUR hardware.
M3 Max vs RTX 4090 vs A100 vs cloud API — same 25-task suite, deterministic score, your environment. PipelineScore is the only public LLM leaderboard that ranks where the model is running, not just which model it is.
Top of the leaderboard
Best run per model across all submissions.
Point the CLI at your model
Ollama, LM Studio, MLX, llama.cpp — anything with an OpenAI-compatible endpoint. Or pass --provider anthropic / openai with your own key.
Tag your hardware
--hardware-tag m3-max-128gb / rtx-4090-24gb / a100-80gb. Same model on different rigs gets ranked separately. Same rig with different models lets you compare apples to apples.
Get your score + tier
A deterministic 0–100 PipelineScore across 25 tasks, a tier badge, total tokens used, average latency, and a spot on the public hardware-aware leaderboard.
Six categories. One number.
We weight each category to mirror real-world LLM usage — code first, reasoning close behind, the rest tuned for everyday operator workloads.