What we store. How long. Why.
PipelineScore is a public benchmark. The leaderboard isthe product. The only way to keep that responsible is a strict retention policy on the things that don't need to live forever — and a public statement of what those things are.
Stored permanently
The signal of the benchmark itself. Removing any of these would break the leaderboard.
- •Model identity (slug, provider, family, release date)
- •Composite PipelineScore, tier, per-category scores
- •The nickname you chose with --user
- •Submission timestamp and lab-verified flag
- •Your optional --config-tag (LoRA / system-prompt / persona / etc.)
- •Which CLI version submitted
Stored for 30 days, then redacted
The body of each run. We keep it for ~30 days so you can audit your score, file a dispute, or share it. After 30 days we overwrite these fields with [redacted:30d_ttl] — the score row stays, the body is gone.
- •Raw prompt + response transcripts
- •Per-task input + model output text
- •Judge rationales
Why: users sometimes submit prompts or outputs that accidentally contain PII, API keys, or proprietary docs. Keeping those bodies indefinitely would compound risk every day. Thirty days is long enough to audit, short enough to limit liability.
Stored for 90 days
Lightweight request metadata to power product analytics, abuse detection, and aggregated reporting. No request bodies — just shape.
- •HTTP method, path, status code, latency
- •IP address (for rate-limit enforcement)
- •User-agent string
- •Nickname, when the request was tied to one
Rolling 90-day window. Older events are hard-deleted by a background job.
Never stored
- •Your API key. The CLI calls Anthropic/OpenAI/your local endpoint directly with your key — the backend never sees it.
- •Request or response payloads beyond the fields listed above.
- •Email, real name, or any personal info beyond the nickname you typed.
How it's enforced
A background job in the API server runs on startup and every hour. It overwrites transcripts on submissions older than 30 days, hard-deletes event-log rows older than 90 days, and logs the row counts to stdout so the operator (Charles & Roe) can verify.
The implementation is open: see backend/src/lib/retention.ts in the repository. You can verify on any submission by checking whether raw_transcripts contains "redacted":true.
Rate limits
To prevent leaderboard flooding and scraping, the backend enforces:
- •200 reads per IP per minute
- •20 submits per IP per hour
- •100 submits per nickname per day
- •5 submits per (nickname, model) per hour
When you hit a limit you get a 429 response with RateLimit-* headers and a JSON error body identifying which layer fired. The CLI handles this gracefully — your local score is still computed.
What if I want my nickname off the board?
Email privacy@pipelinescore.ai with the nickname and we'll redact it. Since there's no auth, we can't verify ownership of a nickname — if someone impersonates you, that's the trade-off of an auth-less system. We'll redact on a good-faith basis.