Summary
Prompt versioning and evals apply software engineering practices -- version control, regression testing, and rollback -- to prompt management. Every prompt change is tracked, tested against a suite of evaluation cases, and deployed only when it passes quality gates.
How it works
- Version each prompt -- store every prompt variant with a unique version ID and changelog.
- Define test cases -- create a suite of input-output pairs that represent expected behavior.
- Run evals on changes -- every prompt update triggers automated evaluation against the test suite.
- Track regressions -- compare eval results to the previous version to catch quality drops.
What to track
- Success rate: Percentage of test cases that produce acceptable outputs.
- Output quality: Human or LLM-as-judge scores for helpfulness, accuracy, tone.
- Cost per call: Token usage changes introduced by the new prompt.
- Latency: Time-to-first-token and total generation time.
Rollback
When a prompt change causes quality regression, revert to the previous version immediately. Keep the failed version and its eval results for post-mortem analysis.