Back to Patterns

Prompt Caching-Aware Layout

Prompt Patterns

Summary

A prompt caching-aware layout arranges prompt sections so that the prefix of the prompt -- the part that is identical across many requests -- is contiguous and maximized. This allows API providers to cache the processed representation of the static prefix, reducing both latency and cost.

How it works

  1. Static first -- place system instructions, tool definitions, and few-shot examples at the very beginning.
  2. Semi-dynamic next -- session-level context that changes infrequently (conversation summary, user profile).
  3. Dynamic last -- the current user input and immediate context go at the end.

Cache strategy

Maximize prefix reuse across requests by keeping the static portion as large as possible and ensuring it does not change between turns. Avoid interleaving static and dynamic content.

Benefits

  • Lower latency: Cached prefixes skip re-computation on subsequent requests.
  • Reduced cost: Many providers charge less for cached prompt tokens.
  • Higher throughput: Cache hits reduce per-request processing time on the server side.

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me a prompt layout optimizer for caching. Architecture: structure prompts in three sections - STATIC (system instructions, tool definitions, fixed examples), SEMI-DYNAMIC (per-session context, conversation history), DYNAMIC (current user input, fresh tool outputs). Put static content first. Error handling: detect when cache hit rate drops below threshold. Edge cases: handle very long static sections, mixed cache provider behavior. Best practices: measure cache hit rate and latency improvement. Testing: verify cache hit rate with repeated identical prefixes.