LLM benchmarks · v3 testpack · deterministic · no API key

Benchmark LLMs on YOUR hardware.

99 models ranked from 428 real runs on real rigs — M3 Max vs RTX 4090 vs A100 vs cloud API. Same 34 deterministic tasks, scored entirely on your machine, one 0–100 score. The only public LLM board that ranks where the model runs, not just which model it is.

$ npx @pipelinescore/cli

That's the whole command. It finds your Ollama / LM Studio / llama.cpp / MLX server, lists your models, auto-detects your hardware, and walks you through the rest.

Run on your hardware Rank the rigs →See where you'd rank →

Submissions 428Users 52Models 99Testpack v3Tasks 34Scored locally · server never re-scores

The model board

Best run per model. Click a column to re-rank, pick two rows to go head-to-head.

How scores are computed →

Weighting

52 models · sorted by Balanced composite (high → low) · pick any two rows to compare

#	Model	PipelineScore ▾	Code	Reason	Tool Use	RAG	Speed	Runs	Tier
1	openai/gpt-oss-20blocal	93.2	100.0	80.0	87.5	100.0	99.3	2	TRUNK
2	DeepSeek R1 671B-A37BdeepseekLab	89.0	85.0	88.9	90.8	89.8	93.6	4	MAINLINE
3	Qwen 3 235B-A22B MoEalibaba	87.9	89.5	91.5	88.7	89.7	76.6	4	MAINLINE
4	DeepSeek Coder V2 236BdeepseekLab	87.6	89.5	82.7	85.7	89.0	92.2	4	MAINLINE
5	DeepSeek V3 671B-A37Bdeepseek	87.3	82.5	89.6	89.1	90.7	86.8	4	MAINLINE
6	Qwen 3 72B InstructalibabaLab	86.0	87.7	84.9	82.7	90.8	82.8	4	MAINLINE
7	Hermes 3 Llama 3.1 405Bnous	85.1	85.3	81.5	87.3	84.0	88.9	4	MAINLINE
8	DeepSeek V4deepseek	84.9	83.7	86.2	80.7	86.7	88.5	12	MAINLINE
9	Qwen 2.5 72B Instructalibaba	84.6	81.5	85.3	87.2	81.7	89.4	4	MAINLINE
10	DeepSeek R1 Distill Qwen 32Bdeepseek	83.9	80.8	86.5	81.7	90.0	81.7	4	MAINLINE
11	mlx-community/Qwen3.6-35B-A3B-4bitlocal	83.7	75.0	80.0	87.5	100.0	82.5	1	MAINLINE
12	Llama 3.3 70B Instructmeta	83.2	87.2	79.8	78.8	85.8	83.2	4	MAINLINE
13	Llama 4 70B Instructmeta	82.3	84.7	81.7	76.3	87.0	80.4	4	MAINLINE
14	Qwen 3.6 72Balibaba	82.0	78.6	81.4	88.4	79.3	84.4	12	MAINLINE
15	Mixtral 8x22B InstructmistralLab	81.9	84.7	79.7	78.7	82.2	83.2	4	MAINLINE
16	Qwen 3 32B InstructalibabaLab	81.8	87.1	82.0	81.0	79.0	76.0	4	MAINLINE
17	Command AcohereLab	81.7	80.9	78.3	85.2	79.8	86.0	4	MAINLINE
18	Qwen 2.5 Coder 32Balibaba	81.2	76.3	84.5	86.0	82.4	78.3	4	MAINLINE
19	Qwen 2.5 32B Instructalibaba	81.1	82.9	78.5	85.8	84.8	71.4	4	MAINLINE
20	DBRX Instruct 132B-MoEdatabricks	81.1	82.6	75.6	84.6	85.8	76.7	4	MAINLINE
21	Mistral Large 2mistral	81.0	82.4	82.8	78.1	83.8	76.3	12	MAINLINE
22	gemma4:12b-it-qat_gpulocal	80.9	87.5	80.0	87.5	100.0	40.6	1	MAINLINE
23	Qwen 3 14B Instructalibaba	80.7	80.4	79.1	77.8	82.9	84.6	4	MAINLINE
24	WizardLM 2 8x22BmicrosoftLab	80.3	81.4	82.7	76.1	82.0	78.0	4	MAINLINE
25	Qwen 2.5 VL 72Balibaba	80.2	79.3	76.1	78.4	84.3	85.7	4	MAINLINE
26	Hermes 3 Llama 3.1 70Bnous	79.9	83.7	81.5	78.0	73.8	79.8	4	MAINLINE
27	Kimi K2 InstructmoonshotLab	79.8	77.1	86.5	77.5	80.3	77.3	4	MAINLINE
28	Devstral Small 24Bmistral	79.3	76.9	79.9	83.4	76.0	81.7	4	MAINLINE
29	Gemma 3 27B ITgoogle	78.2	82.6	79.2	75.7	78.6	70.8	4	MAINLINE
30	Llama 3.1 70B Instructmeta	77.8	83.7	79.7	79.5	74.1	66.1	4	MAINLINE
31	Qwen 2.5 14B InstructalibabaLab	77.6	79.0	73.4	81.8	73.4	80.8	4	MAINLINE
32	DeepSeek V2.5deepseekLab	77.5	76.1	84.5	78.2	81.4	64.9	4	MAINLINE
33	GLM 4 PluszhipuLab	77.4	71.3	82.6	79.8	79.7	75.9	4	MAINLINE
34	Codestral 22BmistralLab	77.2	80.7	77.9	74.0	79.1	71.6	4	MAINLINE
35	DeepSeek Coder V2 16Bdeepseek	76.2	81.7	75.7	73.5	78.5	67.3	4	MAINLINE
36	Yi 1.5 34B ChatyiLab	76.2	81.4	74.3	81.0	73.5	66.8	3	MAINLINE
37	Magnum V4 72Bcommunity	76.0	72.7	76.2	77.7	74.3	81.7	4	MAINLINE
38	Qwen 2.5 Coder 7Balibaba	75.8	77.2	74.0	75.3	76.0	76.4	1	MAINLINE
39	DeepSeek R1 Distill Llama 8BdeepseekLab	75.5	72.7	71.1	81.1	76.3	79.7	2	MAINLINE
40	Gemma 2 27B ITgoogle	75.0	72.5	75.9	76.3	74.9	76.6	3	FEEDER
41	Gemma 3 12B ITgoogle	74.9	74.4	75.7	73.4	72.3	79.6	3	MAINLINE
42	Llama 4 405Bmeta	74.2	71.6	77.1	73.9	77.6	71.6	12	FEEDER
43	Aya 23 35Bcohere	73.4	73.9	72.9	71.2	73.6	75.3	1	FEEDER
44	Mistral Small 24B Instructmistral	73.1	76.8	71.8	71.8	72.7	70.3	1	FEEDER
45	Command R+cohere	73.1	70.8	75.8	77.4	72.8	68.8	5	FEEDER
46	Phi 3.5 MoE 42Bmicrosoft	73.0	72.4	72.9	73.0	77.5	69.1	1	FEEDER
47	L3 70B Euryalecommunity	72.6	74.7	76.0	74.2	71.5	63.1	1	FEEDER
48	InternLM 2.5 20B Chatinternlm	72.0	70.7	71.9	76.1	74.4	67.1	2	FEEDER
49	Phi 4 14BmicrosoftLab	72.0	72.3	74.0	70.0	69.5	73.7	2	FEEDER
50	StarCoder2 15BbigcodeLab	71.8	70.7	71.7	73.5	72.1	71.7	1	FEEDER
51	Qwen 3 8B InstructalibabaLab	71.8	70.3	69.1	74.3	72.6	74.4	1	FEEDER
52	Grok-1 314Bxai	70.9	69.7	69.2	67.7	72.2	77.8	1	FEEDER

Popular matchups

The rivalries worth settling. Every pair opens a live head-to-head.

openai/gpt-oss-20b vs DeepSeek R1 671B-A37B DeepSeek R1 671B-A37B vs Qwen 3 235B-A22B MoE Qwen 3 235B-A22B MoE vs DeepSeek V3 671B-A37B DeepSeek V3 671B-A37B vs DeepSeek Coder V2 236B DeepSeek Coder V2 236B vs Qwen 2.5 72B Instruct Qwen 3 235B-A22B MoE vs Hermes 3 Llama 3.1 405B Hermes 3 Llama 3.1 405B vs Llama 3.3 70B Instruct

m5-max-48gb vs m2-ultra-192gb m2-ultra-192gb vs b200-192gb b200-192gb vs dgx-h100 dgx-h100 vs a100-80gb

Five measures. One number.

Code is executed, reasoning is exact-match, tool use and RAG are JSON-match, speed is measured throughput. No judge model, no rubric, no API key.

Code

28%

Reason

22%

Tool Use

18%

RAG

17%

Speed

15%

The tiers

Every score maps to a tier, named the way pipelines are: from TRUNK (top of the network) down to DRIP.

TRUNK

90–100

MAINLINE

75–89

FEEDER

60–74

TAP

40–59

DRIP

0–39

Hardware board

Which rig wins?

Every hardware tag ranked by its best run — Apple Silicon vs consumer GPUs vs datacenter cards vs CPU-only.

Users board

Every run. Every user.

The community board — sortable, filterable by provider, tier, and hardware. Find someone who ran your model on your rig.

STEP 01

Point the CLI at your model

Ollama, LM Studio, MLX, llama.cpp — anything OpenAI-compatible. Local runs need no account and no API key.

STEP 02

Tag your hardware

--hardware-tag m3-max-128gb / rtx-4090-24gb / a100-80gb. Same model on different rigs gets ranked separately — that's the point.

STEP 03

Land on the board

A deterministic 0–100 PipelineScore across 34 tasks, computed on your machine, plus a tier badge and a public spot on the hardware-aware leaderboard.