Back to Patterns

Prompt Versioning and Evals

Prompt Patterns

Summary

Prompt versioning and evals apply software engineering practices -- version control, regression testing, and rollback -- to prompt management. Every prompt change is tracked, tested against a suite of evaluation cases, and deployed only when it passes quality gates.

How it works

  1. Version each prompt -- store every prompt variant with a unique version ID and changelog.
  2. Define test cases -- create a suite of input-output pairs that represent expected behavior.
  3. Run evals on changes -- every prompt update triggers automated evaluation against the test suite.
  4. Track regressions -- compare eval results to the previous version to catch quality drops.

What to track

  • Success rate: Percentage of test cases that produce acceptable outputs.
  • Output quality: Human or LLM-as-judge scores for helpfulness, accuracy, tone.
  • Cost per call: Token usage changes introduced by the new prompt.
  • Latency: Time-to-first-token and total generation time.

Rollback

When a prompt change causes quality regression, revert to the previous version immediately. Keep the failed version and its eval results for post-mortem analysis.

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me a prompt versioning and eval system. Architecture: each prompt version has an ID, hash, timestamp, and associated test suite. Store versions in a database with metadata. Error handling: detect regressions by comparing eval scores across versions. Edge cases: handle prompt rollback, version conflicts in team environments. Best practices: always run the full eval suite before promoting a prompt to production. Testing: verify that rollback correctly restores previous behavior.