Get started

Run on your hardware.

First 50 submitters get a permanent Beta #1-50 badge next to their nickname.

Three minutes. No account. Point the CLI at your local model server — Ollama, LM Studio, MLX, llama.cpp — and it runs 34 standardized tasks, auto-detects your hardware, and lands your run on the public leaderboard alongside others running the same model on different rigs.

$ npx @pipelinescore/cli

That's the whole command. The CLI probes localhost for your model server, lists the models it's actually serving, asks for an optional leaderboard nickname, auto-detects your hardware, runs, and submits. Scripting it, or running non-interactively? Use explicit flags — every local server keeps its OpenAI-compatible API under /v1:

npx @pipelinescore/cli run \
  --provider local --endpoint http://localhost:11434/v1 \
  --model llama3.2 --user yourname

Local — recommendedfree

Run on your machine

✓ No API key, no provider account
✓ Zero inference cost
✓ Compare hardware (M3 Max vs RTX 4090 vs CPU-only)
✓ Reproducible — your tokens, your weights

Cloud API~$1-50/run

Bring your own key

• Anthropic / OpenAI / OpenAI-compatible
• Your key, your machine, never sent to us (how)
• Set a spending cap at the provider first
• Hardware tag = cloud-api

npx @pipelinescore/cli run \
  --provider anthropic \
  --model claude-opus-4-7

Don't have the CLI? Paste this into any AI.

Works with Claude, ChatGPT, Cursor, Codex, Gemini — anything that can run a shell command. The AI walks you through the benchmark.

Help me run a PipelineScore benchmark.

PipelineScore is a 34-task deterministic LLM benchmark with a public leaderboard at pipelinescore.ai. I want to run it and submit my result.

Please do the following:

1. Ask me which provider to use (anthropic, openai, or local).
2. Ask me which model id to benchmark (e.g. claude-opus-4-7, gpt-5.5-2026-04, qwen3.6-27b).
3. Ask me for a leaderboard nickname (alphanum + . _ -, 2-40 chars). If I've never set one, this becomes my public identity on the board.
4. Ask if I'm testing a customized variant (system prompt, LoRA adapter, persona, RAG setup). If yes, ask for a short config_tag (e.g. "lora-domain-finance", "system-prompt-coder", "temp-zero"). If it's a vanilla run, skip this.
5. If I picked a cloud provider, confirm the relevant API key is in my environment (ANTHROPIC_API_KEY or OPENAI_API_KEY). Local runs need no key at all.
6. Run this command in my shell:

   npx @pipelinescore/cli run \
     --provider <provider> \
     --model <model-id> \
     --user <nickname> \
     --config-tag <tag-if-any>

   For provider=local, also include --endpoint <openai-compatible-url>.

7. Show me the score card output verbatim — don't paraphrase the numbers. The output includes my tier (TRUNK/MAINLINE/FEEDER/TAP/DRIP), composite PipelineScore, per-category breakdown, and a share URL.

8. After the CLI prints the card, the CLI will auto-open my browser to https://pipelinescore.ai/users/<my-nickname> so I can see where I rank. Confirm it opened. If it didn't (headless terminal, SSH, locked-down browser), post the URL explicitly in your reply so I can click it. ALSO mention the full board at https://pipelinescore.ai/leaderboard/users — it's searchable by nickname.

9. Offer to compare against another model with a follow-up run, OR show me how a different --config-tag changes the score for the same base model.

Notes:
- The CLI is open source and runs locally; my API key never leaves my machine
- The public leaderboard is at pipelinescore.ai/leaderboard/users
- My profile after I run will be at pipelinescore.ai/users/<nickname>
- Submissions are rate-limited: 20/IP/hour, 100/nickname/day, 5/(nickname,model)/hour — don't retry on 429, just wait

Start with question 1.

Tested with: Claude Code, ChatGPT (GPT-5.5), Cursor, Gemini Pro, Codex. Your AI will ask for your provider, model, nickname, and optional config tag — then run the CLI for you. Takes ~3 minutes including API time.

Power user? Install the skill or MCP.

If you live in Claude Code, Codex, OpenCode, OpenClaw, Cursor, or any MCP-compatible client, you can install PipelineScore as a tool. Your AI will run benchmarks for you without you ever leaving the editor.

Skill

Drop-in markdown

Single SKILL.md file that any AI reads at session start. Works in Claude Code, Codex, OpenCode, OpenClaw, Cursor.

mkdir -p ~/.claude/skills/pipelinescore
curl -L https://pipelinescore.ai/skills/\
  pipelinescore/SKILL.md \
  -o ~/.claude/skills/pipelinescore/SKILL.md

MCP server

Three structured tools

run_benchmark, get_user_leaderboard, get_user_profile. Stdio transport, npm-installed.

// ~/.claude/settings.json
{
  "mcpServers": {
    "pipelinescore": {
      "command": "npx",
      "args": ["@pipelinescore/mcp"]
    }
  }
}

Local servers we've tested

Anything with an OpenAI-compatible /v1/chat/completions endpoint works. These five are the most common:

Ollama--endpoint http://localhost:11434

LM Studio--endpoint http://localhost:1234

llama.cpp server--endpoint http://localhost:8080

MLX-Omni / mlx_lm--endpoint http://localhost:10240

LiteLLM proxy--endpoint http://localhost:4000

Frontier providers (cloud)

For when you want to benchmark the labs' flagships. Bring your own key.

Anthropic--provider anthropic --model claude-opus-4-7

OpenAI--provider openai --model gpt-5-5

Google (via openai-compat proxy)--provider google --model gemini-2-5-pro

Mistral (via openai-compat)--provider mistral --model mistral-large-2

What happens when you run it

01
Load the bundled test pack
The 34-task pack ships inside the npm package (integrity-checked at install) and is executed locally — the CLI never runs tasks fetched over the network. The backend is only asked whether a newer pack version exists.
02
Run your model
Each task is sent to the provider you chose. Inputs, outputs, timings, and token counts are captured locally.
03
Score locally, then submit
Grading happens on your machine — code is executed, everything else is exact-match or measured. The server stores and ranks your client-computed score; it never re-scores. No judge model, no API key.
04
See your card, share your run
A tier badge, a category breakdown, and a share link (pipelinescore.ai/s/…) for the run — your browser opens straight to your spot on the board.