Back to Patterns

Model Cascade Routing

Agent Patterns

Summary

Model cascade routing reduces cost and latency by sending each request to the cheapest capable model first. Only when the initial model cannot produce a satisfactory result is the request escalated to a more powerful (and expensive) model.

How it works

  1. Tier system -- models are organized into tiers by capability and cost.
  2. Confidence check -- after a model produces an output, a confidence estimator evaluates whether the result is acceptable.
  3. Escalation -- if confidence is below threshold, the request is forwarded to the next tier.

Tiers

  • Fast-cheap: Small models for simple tasks (classification, extraction, formatting).
  • Balanced: Mid-size models for most routine reasoning and generation tasks.
  • Powerful: Large frontier models reserved for complex reasoning, creative work, and edge cases.

Metrics

  • Cost per request: Average cost across all tiers, weighted by request distribution.
  • Escalation rate: Percentage of requests that require a higher tier.
  • Latency per tier: P50 and P95 response times for each model tier.

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me a model cascade routing system. Architecture: define a model tier list: fast-cheap, balanced, powerful-expensive. Route each request to the cheapest tier first. If confidence is below threshold, escalate to the next tier. Error handling: handle all tiers failing, cascade timeout. Edge cases: handle requests that need specific model capabilities, handle models with different context limits. Best practices: track cost per request, latency per tier, and escalation rate. Testing: verify correct tier selection based on request complexity.