PipelineScore
Local-first LLM benchmark · v1

Benchmark LLMs on YOUR hardware.

M3 Max vs RTX 4090 vs A100 vs cloud API — same 25-task suite, deterministic score, your environment. PipelineScore is the only public LLM leaderboard that ranks where the model is running, not just which model it is.

Submissions424
Users on the board50
Models benchmarked96
STEP 01

Point the CLI at your model

Ollama, LM Studio, MLX, llama.cpp — anything with an OpenAI-compatible endpoint. Or pass --provider anthropic / openai with your own key.

STEP 02

Tag your hardware

--hardware-tag m3-max-128gb / rtx-4090-24gb / a100-80gb. Same model on different rigs gets ranked separately. Same rig with different models lets you compare apples to apples.

STEP 03

Get your score + tier

A deterministic 0–100 PipelineScore across 25 tasks, a tier badge, total tokens used, average latency, and a spot on the public hardware-aware leaderboard.

Six categories. One number.

We weight each category to mirror real-world LLM usage — code first, reasoning close behind, the rest tuned for everyday operator workloads.

Code
25%
Reason
20%
Write
15%
Tool Use
15%
RAG
12%
Speed
13%