Encoder-decoder models combine both encoder and decoder components of the Transformer architecture. These models, exemplified by T5 (Text-to-Text Transfer Transformer), BART, and others, excel at sequence-to-sequence tasks like translation and summarization.
Key Characteristics
Sequence-to-Sequence: Process an input sequence and generate an output sequence of potentially different length
Cross-Attention: Decoder attends to encoder outputs to condition on the input sequence
Versatile Tasks: Can handle various tasks by framing them as text-to-text problems
Bidirectional Encoder: Encoder processes input in both directions for full context understanding
Popular Models
T5 (Text-to-Text Transfer Transformer): Frames all NLP tasks as text generation
BART (Bidirectional and Auto-Regressive Transformers): Combines bidirectional encoding with autoregressive decoding
mBART: Multilingual variant of BART for machine translation
Pegasus: Pre-trained for abstractive summarization
Build This Pattern
Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.
Explain full encoder-decoder (T5-style) transformer architecture. Architecture: describe the two-component design: encoder with bidirectional attention processes the full input sequence, decoder with cross-attention to encoder representations generates output autoregressively. Explain the text-to-text framework where all NLP tasks are framed as text generation with task-specific prefixes (e.g., summarize:, translate English to German:). Walk through training objectives: span corruption for T5 (masking contiguous spans of tokens), text infilling for BART (corrupting spans with a single mask token), and denoising autoencoding. Compare key model families and trade-offs. Contrast with decoder-only: encoder-decoder excels at tasks needing strong input understanding (translation, summarization, table-to-text) but has more parameters and slower two-pass inference. Error handling: discuss challenges with cross-attention over very long inputs, decoding errors amplified by encoder representation quality, and catastrophic forgetting during fine-tuning. Edge cases: support for tasks without clear prefix mapping, handling copy-heavy tasks (extractive summarization, data-to-text), and multilingual processing with vocabulary overlap or different scripts. Best practices: include guidance on prefix design, choosing encoder-decoder versus decoder-only based on task requirements, and efficient decoding techniques like beam search with length penalty and early stopping. Testing: suggest evaluating on standard benchmarks (CNN/DailyMail for summarization, WMT for translation), testing prefix formatting consistency, and measuring encoder-decoder alignment through cross-attention visualization.