Summary
Model cascade routing reduces cost and latency by sending each request to the cheapest capable model first. Only when the initial model cannot produce a satisfactory result is the request escalated to a more powerful (and expensive) model.
How it works
- Tier system -- models are organized into tiers by capability and cost.
- Confidence check -- after a model produces an output, a confidence estimator evaluates whether the result is acceptable.
- Escalation -- if confidence is below threshold, the request is forwarded to the next tier.
Tiers
- Fast-cheap: Small models for simple tasks (classification, extraction, formatting).
- Balanced: Mid-size models for most routine reasoning and generation tasks.
- Powerful: Large frontier models reserved for complex reasoning, creative work, and edge cases.
Metrics
- Cost per request: Average cost across all tiers, weighted by request distribution.
- Escalation rate: Percentage of requests that require a higher tier.
- Latency per tier: P50 and P95 response times for each model tier.