PipelineScore
Get started

Run on your hardware.

First 50 submitters get a permanent Beta #1-50 badge next to their nickname.

Three minutes. No account. Point the CLI at your local model server — Ollama, LM Studio, MLX, llama.cpp — and tag your hardware. The CLI runs 25 standardized tasks, submits the result with your hardware tag, and your run lands on the public leaderboard alongside others running the same model on different rigs.

$ npx @pipelinescore/cli run \
--provider local --endpoint http://localhost:11434 \
--model llama-3.3-70b --hardware-tag m3-max-128gb \
--user your-handle

Default Ollama endpoint above. Swap port for LM Studio (1234), llama.cpp (8080), MLX-Omni (10240), or any OpenAI-compatible server.

Local — recommendedfree

Run on your machine

  • ✓ No API key, no provider account
  • ✓ Zero inference cost
  • ✓ Compare hardware (M3 Max vs RTX 4090 vs CPU-only)
  • ✓ Reproducible — your tokens, your weights
Cloud API~$1-50/run

Bring your own key

  • • Anthropic / OpenAI / OpenAI-compatible
  • • Your key, your machine, never sent to us (how)
  • • Set a spending cap at the provider first
  • • Hardware tag = cloud-api
npx @pipelinescore/cli run \
  --provider anthropic \
  --model claude-opus-4-7 \
  --user your-handle

Don't have the CLI? Paste this into any AI.

Works with Claude, ChatGPT, Cursor, Codex, Gemini — anything that can run a shell command. The AI walks you through the benchmark.

Tested with: Claude Code, ChatGPT (GPT-5.5), Cursor, Gemini Pro, Codex. Your AI will ask for your provider, model, nickname, and optional config tag — then run the CLI for you. Takes ~3 minutes including API time.

Power user? Install the skill or MCP.

If you live in Claude Code, Codex, OpenCode, OpenClaw, Cursor, or any MCP-compatible client, you can install PipelineScore as a tool. Your AI will run benchmarks for you without you ever leaving the editor.

Skill

Drop-in markdown

Single SKILL.md file that any AI reads at session start. Works in Claude Code, Codex, OpenCode, OpenClaw, Cursor.

mkdir -p ~/.claude/skills/pipelinescore
curl -L https://pipelinescore.ai/skills/\
  pipelinescore/SKILL.md \
  -o ~/.claude/skills/pipelinescore/SKILL.md
MCP server

Three structured tools

run_benchmark, get_user_leaderboard, get_user_profile. Stdio transport, npm-installed.

// ~/.claude/settings.json
{
  "mcpServers": {
    "pipelinescore": {
      "command": "npx",
      "args": ["@pipelinescore/mcp"]
    }
  }
}

Local servers we've tested

Anything with an OpenAI-compatible /v1/chat/completions endpoint works. These five are the most common:

Ollama--endpoint http://localhost:11434
LM Studio--endpoint http://localhost:1234
llama.cpp server--endpoint http://localhost:8080
MLX-Omni / mlx_lm--endpoint http://localhost:10240
LiteLLM proxy--endpoint http://localhost:8000

Frontier providers (cloud)

For when you want to benchmark the labs' flagships. Bring your own key.

Anthropic--provider anthropic --model claude-opus-4-7
OpenAI--provider openai --model gpt-5-5
Google (via openai-compat proxy)--provider google --model gemini-2-5-pro
Mistral (via openai-compat)--provider mistral --model mistral-large-2

What happens when you run it

  1. 01

    Fetch today's signed test pack

    The CLI pulls the rotating 25-task pack from api.pipelinescore.ai/v1/testpack — same pack for every submission today, different tomorrow.

  2. 02

    Run your model

    Each task is sent to the provider you chose. Inputs, outputs, timings, and token counts are captured locally.

  3. 03

    Submit for re-judgment

    Transcripts are uploaded and re-graded server-side by a held-out judge model. Your local score is provisional; the server's is canonical.

  4. 04

    See your card

    You get a tier badge, a category breakdown, and a shareable URL — ready for the leaderboard.