Summary

The Evaluator-Optimizer pattern implements an iterative improvement loop where a generator LLM creates solutions, an evaluator LLM assesses them against specific criteria, and then the generator refines the solution based on feedback. This cycle continues until the solution meets the desired quality threshold or reaches a maximum number of iterations.

How it works

  1. Generate: Producer model creates initial solution
  2. Evaluate: Critic model scores against criteria
  3. Feedback Loop: If insufficient quality, return feedback to generator
  4. Refine: Generator produces improved version
  5. Terminate: Stop when threshold met or max iterations reached

Evaluation dimensions

  • Accuracy: Factual correctness, mathematical precision
  • Style: Tone consistency, voice adherence
  • Completeness: Coverage of required elements
  • Safety: No harmful content, bias detection

Use cases

  • Content creation where quality and adherence to specific criteria are important
  • Problem-solving tasks that benefit from iterative refinement and critical feedback
  • Creative writing with specific style, tone, or structural requirements
  • Code generation that needs to meet specific performance or style guidelines
  • Educational content that requires accurate information and appropriate difficulty levels

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me an evaluator-optimizer loop for LLM output refinement. Architecture: implement a generator module that creates the initial output, and an evaluator module that reviews output against quality criteria and provides specific, actionable feedback. The generator uses the feedback to produce an improved version. Loop continues until the evaluator passes the output or max iterations is reached (configurable default 3). Define quality criteria as an array of rules with weights and pass/fail thresholds. Track full iteration history including each output version, feedback, and criterion scores. Error handling: detect loops where output stops improving between iterations by comparing similarity scores. If no improvement over 2 consecutive iterations, return the best version and break. Handle evaluator failure by using the last valid evaluation. Edge cases: handle single-criterion optimization efficiently by bailing early when met. Support criteria that are pass/fail versus scored (1-10). Handle evaluator contradicting itself between iterations. Best practices: return final output plus full iteration log showing improvement trajectory. Make quality criteria configurable at runtime. Testing: test with criteria always met (0 iterations), never met (max iterations), and met after specific iterations. Verify iteration log captures all versions. TypeScript with OpenAI SDK.