Content not available for this pattern.

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Compare the 4 main transformer architectures: encoder-only (BERT-style), decoder-only (GPT-style), encoder-decoder (T5-style), and Mixture of Experts (MoE-style). Architecture: for each type, describe the core design, training objective, inference pattern, and representative models. Compare across dimensions: training efficiency (compute, data, stability), inference speed (latency, throughput, memory), parameter count versus effective compute per token, use case suitability (understanding versus generation), scaling behavior, and implementation complexity. Create a decision matrix with clear recommendations: text classification to encoder-only, chat and creative writing to decoder-only, translation and summarization to encoder-decoder, massive scale with compute constraints to MoE. Error handling: discuss failure modes specific to each architecture - encoding forgetting in encoder-only, hallucination in decoder-only, encoder-decoder alignment failures, and expert imbalance in MoE. Edge cases: handling tasks that cross architecture boundaries (e.g., using decoder-only for classification via prompting), performance when scaling beyond typical parameter ranges, and hardware-specific considerations (MoE communication overhead on slow interconnects). Best practices: include guidance on choosing the right architecture for a given deployment constraint (latency-sensitive, memory-bound, throughput-oriented). Discuss hybrid approaches like using encoder embeddings as decoder model input. Testing: suggest evaluating multiple architectures on the same downstream task with controlled budgets, measuring both quality metrics and system-level performance (latency, throughput, memory, cost per query).