Summary
A prompt caching-aware layout arranges prompt sections so that the prefix of the prompt -- the part that is identical across many requests -- is contiguous and maximized. This allows API providers to cache the processed representation of the static prefix, reducing both latency and cost.
How it works
- Static first -- place system instructions, tool definitions, and few-shot examples at the very beginning.
- Semi-dynamic next -- session-level context that changes infrequently (conversation summary, user profile).
- Dynamic last -- the current user input and immediate context go at the end.
Cache strategy
Maximize prefix reuse across requests by keeping the static portion as large as possible and ensuring it does not change between turns. Avoid interleaving static and dynamic content.
Benefits
- Lower latency: Cached prefixes skip re-computation on subsequent requests.
- Reduced cost: Many providers charge less for cached prompt tokens.
- Higher throughput: Cache hits reduce per-request processing time on the server side.