KV-Cache Optimization: AI Pattern

Content not available for this pattern.

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Explain KV-cache optimization for transformer inference. Architecture: describe how autoregressive generation recomputes KV projections for all previous tokens at each step, and how caching eliminates this redundancy. Walk through the implementation: allocate K and V tensors per attention layer, pre-fill from prompt processing, then append new token projections during generation. Derive complexity reduction from O(n^2) per step to O(n) for the new token. Cover memory overhead - KV-cache grows linearly with sequence length and batch size, potentially dominating VRAM for long contexts. Error handling: discuss cache misses when generation is interrupted or reset, numerical drift in cached values over very long generations, and memory fragmentation with variable-length sequences. Edge cases: handling very long generations where cache exceeds VRAM, interaction with beam search requiring separate caches per hypothesis, and compatibility with speculative decoding where multiple tokens are generated per step. Best practices: include practical guidance on implementing PagedAttention (non-contiguous memory), choosing between MHA, MQA, and GQA based on latency versus quality trade-offs, and applying KV-cache quantization (INT8, FP8) with calibration data. Discuss sliding window caches for infinite-length generation. Testing: suggest measuring peak memory usage against sequence length, validating cache correctness against non-cached baseline, and profiling cache-related operations for optimization.