Summary

Mixture of Experts (MoE) models represent a paradigm shift in neural network design, using sparse activation to dramatically increase model capacity while maintaining computational efficiency. Models like Mixtral, Switch Transformer, and GLaM have shown that MoE architectures can achieve remarkable performance with trillions of parameters.

Key Characteristics

  • Sparse Activation: Only a subset of experts (typically 2-4) are active per token, reducing computation
  • Gating Network: Learns to route tokens to the most appropriate experts based on input content
  • Massive Parameters: Can have trillions of parameters while keeping inference compute manageable
  • Expert Specialization: Different experts can develop distinct strengths for different types of inputs

Popular Models

  • Mixtral 8x7B: 8 experts per layer, 2 active per token, 46.7B total parameters, open weights
  • Switch Transformer: Google Research (2021), simplified routing, trillion parameter scale
  • GLaM: Google (2021), 64 experts per layer, 1.2T total parameters
  • DeepSeek V2: Mixture of experts with 236B total parameters

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Explain Mixture of Experts (MoE) architecture for transformers. Architecture: describe how dense FFN layers are replaced with multiple expert sub-networks, with a learned gating network that selects top-k experts per token (typically k=2). Explain sparse activation: only a small subset of experts process each token, enabling vastly more parameters without proportional compute. Walk through the routing mechanism with softmax over expert logits, top-k selection, and weighted combination of expert outputs. Cover the load balancing loss that prevents expert collapse. Reference representative model families. Error handling: discuss expert collapse where the router assigns all tokens to a subset of experts, routing instability during training, and expert capacity overflow where tokens are dropped or padded. Edge cases: behavior during inference when an expert or router fails, handling tokens that fall outside all expert specializations, and performance at different capacity factors. Discuss scaling to hundreds of experts across multiple machines. Best practices: include guidance on choosing number of experts and top-k value based on compute budget, implementing efficient all-to-all communication for distributed MoE, and using expert choice routing as an alternative for better load balance. Testing: suggest evaluating expert utilization entropy, load balancing loss contribution, and per-expert specialization through activation statistics.