Run on your hardware.
Three minutes. No account. Point the CLI at your local model server — Ollama, LM Studio, MLX, llama.cpp — and tag your hardware. The CLI runs 25 standardized tasks, submits the result with your hardware tag, and your run lands on the public leaderboard alongside others running the same model on different rigs.
--provider local --endpoint http://localhost:11434 \
--model llama-3.3-70b --hardware-tag m3-max-128gb \
--user your-handle
Default Ollama endpoint above. Swap port for LM Studio (1234), llama.cpp (8080), MLX-Omni (10240), or any OpenAI-compatible server.
Run on your machine
- ✓ No API key, no provider account
- ✓ Zero inference cost
- ✓ Compare hardware (M3 Max vs RTX 4090 vs CPU-only)
- ✓ Reproducible — your tokens, your weights
Bring your own key
- • Anthropic / OpenAI / OpenAI-compatible
- • Your key, your machine, never sent to us (how)
- • Set a spending cap at the provider first
- • Hardware tag = cloud-api
npx @pipelinescore/cli run \ --provider anthropic \ --model claude-opus-4-7 \ --user your-handle
Don't have the CLI? Paste this into any AI.
Works with Claude, ChatGPT, Cursor, Codex, Gemini — anything that can run a shell command. The AI walks you through the benchmark.
Power user? Install the skill or MCP.
If you live in Claude Code, Codex, OpenCode, OpenClaw, Cursor, or any MCP-compatible client, you can install PipelineScore as a tool. Your AI will run benchmarks for you without you ever leaving the editor.
Drop-in markdown
Single SKILL.md file that any AI reads at session start. Works in Claude Code, Codex, OpenCode, OpenClaw, Cursor.
mkdir -p ~/.claude/skills/pipelinescore curl -L https://pipelinescore.ai/skills/\ pipelinescore/SKILL.md \ -o ~/.claude/skills/pipelinescore/SKILL.md
Three structured tools
run_benchmark, get_user_leaderboard, get_user_profile. Stdio transport, npm-installed.
// ~/.claude/settings.json
{
"mcpServers": {
"pipelinescore": {
"command": "npx",
"args": ["@pipelinescore/mcp"]
}
}
}Local servers we've tested
Anything with an OpenAI-compatible /v1/chat/completions endpoint works. These five are the most common:
Frontier providers (cloud)
For when you want to benchmark the labs' flagships. Bring your own key.
What happens when you run it
- 01
Fetch today's signed test pack
The CLI pulls the rotating 25-task pack from
api.pipelinescore.ai/v1/testpack— same pack for every submission today, different tomorrow. - 02
Run your model
Each task is sent to the provider you chose. Inputs, outputs, timings, and token counts are captured locally.
- 03
Submit for re-judgment
Transcripts are uploaded and re-graded server-side by a held-out judge model. Your local score is provisional; the server's is canonical.
- 04
See your card
You get a tier badge, a category breakdown, and a shareable URL — ready for the leaderboard.