Back to Architecture Recipes

Mixture of Experts

SparseEfficiency

Sparse MoE

Models with specialized subnetworks for scalable capacity.

Overview

Mixture of Experts (MoE) models represent a paradigm shift in neural network design, using sparse activation to dramatically increase model capacity while maintaining computational efficiency. Models like Mixtral, Switch Transformer, and GLaM have shown that MoE architectures can achieve remarkable performance with trillions of parameters.

Key characteristics

  • Sparse Activation: Only a subset of experts (typically 2-4) are active per token, reducing computation
  • Gating Network: Learns to route tokens to the most appropriate experts based on input content
  • Massive Parameters: Can have trillions of parameters while keeping inference compute manageable
  • Expert Specialization: Different experts can develop distinct strengths for different types of inputs

Popular models

  • Mixtral 8x7B: 8 experts per layer, 2 active per token, 46.7B total parameters, open weights
  • Switch Transformer: Google Research (2021), simplified routing, trillion parameter scale
  • GLaM: Google (2021), 64 experts per layer, 1.2T total parameters
  • DeepSeek V2: Mixture of experts with 236B total parameters

Core steps

  1. Input Processing: Token embeddings are passed through the network
  2. Gating Computation: Gating network computes expert selection scores
  3. Expert Routing: Top-k experts are selected based on gating scores
  4. Parallel Processing: Selected experts process the token in parallel
  5. Weighted Combination: Expert outputs are weighted by gating scores and combined

Routing strategies

  • Top-k: Select top-k scoring experts
  • Expert choice: Each expert selects tokens
  • Hash-based: Fixed routing based on hashing
  • Load-balanced: Explicit capacity constraints

Training considerations

  • Load imbalance: Auxiliary balancing loss
  • Expert collapse: Noisy top-k routing
  • Training stability: Gradient clipping, warmup
  • Inference efficiency: Expert caching, batching

Architecture overview

MoE models replace the dense feed-forward layers in Transformers with sparse expert layers. A gating network routes each token to the top-k experts, and their outputs are weighted and combined.

Core steps

  • Input Processing: Token embeddings are passed through the network
  • Gating Computation: Gating network computes expert selection scores
  • Expert Routing: Top-k experts are selected based on gating scores
  • Parallel Processing: Selected experts process the token in parallel
  • Weighted Combination: Expert outputs are weighted by gating scores and combined

Applications

Large Language Models

  • OpenAssistant: Open-source assistant trained using MoE architecture
  • Mixtral: State-of-the-art open-source MoE model
  • WizardLM: Instruction-tuned model using MoE for scalability

Benefits and Challenges

  • Benefits: Scalability, specialization, efficiency, parallelism
  • Challenges: Complex training, load balancing, memory requirements

Industry applications

  • AI Research: Pushing the boundaries of model scale and capability
  • Enterprise AI: Large-scale deployments with efficient inference
  • Multilingual Models: Experts specialized for different languages
Code View

Mixture of Experts Implementation

// Mixture of Experts recipe using OpenAI
// Install: bun add openai

import OpenAI from "openai";

async function main() {
  const input = "Add your prompt here.";
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  const system = "You are a senior AI engineer and technical writer. Explain how the architecture applies to the request and outline practical implementation guidance. Recipe: Mixture of Experts. Description: Models with specialized subnetworks for scalable capacity. Focus: Sparse MoE Provide actionable, implementation-ready guidance.";
  const user = `Request: ${input}`;

  const openaiResponse = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: system },
      { role: "user", content: user },
    ],
  });

  const openaiText = openaiResponse.choices[0]?.message?.content?.trim() ?? "";

  console.log(openaiText);
}

main().catch((error) => {
  console.error(error);
  process.exitCode = 1;
});