Evaluator Optimizer: AI Pattern

Summary

The Evaluator-Optimizer pattern implements an iterative improvement loop where a generator LLM creates solutions, an evaluator LLM assesses them against specific criteria, and then the generator refines the solution based on feedback. This cycle continues until the solution meets the desired quality threshold or reaches a maximum number of iterations.

How it works

Generate: Producer model creates initial solution
Evaluate: Critic model scores against criteria
Feedback Loop: If insufficient quality, return feedback to generator
Refine: Generator produces improved version
Terminate: Stop when threshold met or max iterations reached

Evaluation dimensions

Accuracy: Factual correctness, mathematical precision
Style: Tone consistency, voice adherence
Completeness: Coverage of required elements
Safety: No harmful content, bias detection

Use cases

Content creation where quality and adherence to specific criteria are important
Problem-solving tasks that benefit from iterative refinement and critical feedback
Creative writing with specific style, tone, or structural requirements
Code generation that needs to meet specific performance or style guidelines
Educational content that requires accurate information and appropriate difficulty levels

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me an evaluator-optimizer loop for LLM output refinement. Architecture: implement a generator module that creates the initial output, and an evaluator module that reviews output against quality criteria and provides specific, actionable feedback. The generator uses the feedback to produce an improved version. Loop continues until the evaluator passes the output or max iterations is reached (configurable default 3). Define quality criteria as an array of rules with weights and pass/fail thresholds. Track full iteration history including each output version, feedback, and criterion scores. Error handling: detect loops where output stops improving between iterations by comparing similarity scores. If no improvement over 2 consecutive iterations, return the best version and break. Handle evaluator failure by using the last valid evaluation. Edge cases: handle single-criterion optimization efficiently by bailing early when met. Support criteria that are pass/fail versus scored (1-10). Handle evaluator contradicting itself between iterations. Best practices: return final output plus full iteration log showing improvement trajectory. Make quality criteria configurable at runtime. Testing: test with criteria always met (0 iterations), never met (max iterations), and met after specific iterations. Verify iteration log captures all versions. TypeScript with OpenAI SDK.