The MoE Promise and What It Leaves Out
The MoE efficiency argument is this: replace each dense FFN layer with E expert FFNs. Route each token to k of them (k ≪ E). Pay per-token compute proportional to k active experts, not E total experts. Use total parameters (and therefore model capacity) proportional to E, but compute proportional to k. Win.
DeepSeek-V3 makes this concrete: 671B total parameters, 37B active per token with E=256, k=8. The served model has the representational capacity of a 671B dense model at the per-token compute cost of a 37B dense model. If you believe that benchmark, the math is genuinely remarkable.
But the benchmark is almost always measured on a single GPU, running one token at a time, in a controlled evaluation harness. Production serving is different. In production, you have expert parallelism across multiple GPUs, dynamic routing that creates load imbalance, capacity buffers that occasionally drop tokens, and a router that must make decisions — and communicate them — millions of times per second. The router is not free. And the costs it introduces are systemic.
The router's computational cost is trivial. The router's communication, load imbalance, cache miss, and tail latency costs are not. They are what the headline throughput numbers are measured in spite of, not because of.
What the Router Actually Does, Step by Step
Let us be precise about the router's operation in a production MoE serving system. For each token in the current batch, per MoE layer:
- Gate computation: Compute
logits = W_gate × hiddenwhere W_gate has shape [d_model × E]. For d_model=7168, E=256: this is a 7168×256 matrix-vector product. At BF16: ~3.7 MB weight read, ~3.7 MFLOP. Trivially fast. - Top-k selection: Find the k=8 highest logits. Apply softmax over those k to get routing weights. For E=256, k=8: a sort/top-k over 256 floats. Also trivial.
- Dispatch: Group tokens by their assigned experts. For expert-parallel serving (experts sharded across GPUs), this requires an all-to-all communication: each GPU sends the tokens assigned to remote experts, and receives tokens assigned to its local experts from remote GPUs. This is not trivial.
- Expert GEMM: Each GPU runs GEMM for its local experts on the received token subset. This is the intended work: FFN forward pass on the active subset.
- Gather: Collect expert outputs back to the originating GPUs via another all-to-all. Scale by routing weights and sum.
Steps 3 and 5 — dispatch and gather — are the router tax. They are all-to-all collective operations whose cost scales with the number of GPUs, the number of tokens, and the token-to-expert assignment pattern. They are also deeply unpredictable in latency because they depend on the actual routing decisions, which are input-dependent.
All-to-All at Scale: The Dispatch/Gather Tax Quantified
For a production MoE serving system with expert parallelism degree EP, each MoE layer requires two all-to-all collective operations. The data volume per all-to-all is:
# Volume per all-to-all for one MoE layer
# B: batch size (tokens), d: model dimension, EP: expert parallel degree
bytes_per_dispatch = B * d_model * bytes_per_element
# Example: B=512, d=7168, BF16 (2 bytes)
bytes_per_dispatch_example = 512 * 7168 * 2 # = 7.3 MB
# Two all-to-alls per MoE layer (dispatch + gather)
bytes_per_moe_layer = 2 * bytes_per_dispatch # = 14.6 MB
# Across L_moe MoE layers in the model
# DeepSeek-V3: ~58 MoE layers out of 61 total
total_alltoall_bytes = 58 * bytes_per_moe_layer # = 847 MB per forward pass
# At 400 Gbps NVLink per direction:
# 847 MB / (400 Gbps / 8) = 847 MB / 50 GB/s = ~17 ms
# But this assumes perfect overlap with compute — rarely achieved
17 ms of all-to-all communication per forward pass, at batch size 512. At batch size 64 (more typical for long-context serving), this scales down: ~2 ms of communication time. But communication time does not overlap cleanly with compute because the expert GEMM cannot start until the dispatch all-to-all completes. The dispatch serializes with the compute. Only the gather can overlap with subsequent attention computation.
The Worst Property of Real Routing: It Is Never Uniform
The MoE efficiency math assumes that routing is perfectly balanced — each expert receives exactly B×k/E tokens per batch. In practice, routing distributions are highly skewed. Tokens with similar semantic properties activate similar experts. A batch of coding tokens will heavily load coding-specialized experts and leave math-specialized experts idle. A batch of multilingual tokens may have complex bimodal distributions across language-specialized expert subsets.
The consequence is structural. In expert-parallel serving, the total layer latency is determined by the slowest GPU — the one that received the most tokens from the all-to-all dispatch. An imbalanced batch where one GPU receives 3× the average token load forces all other GPUs to wait after completing their expert GEMMs.
At serving concurrency of 100 requests, the probability that at least one expert receives 2× the mean token load approaches 1 as batch size grows above ~64 tokens. Every such event adds a bubble to the entire MoE forward pass. Unlike GPU kernel latency — which is roughly deterministic — routing imbalance is input-dependent and cannot be predicted or pre-scheduled. It is the primary source of p99 tail latency spikes in production MoE serving.
The Capacity Buffer Tax
The standard mitigation for routing imbalance is the expert capacity buffer: each expert is allocated a capacity of C = (B/E) × capacity_factor tokens per step, where capacity_factor is typically 1.25. Tokens assigned to an expert that has already hit its capacity are dropped — they are not processed by any expert and their contribution to the output is zeroed out.
This introduces a quality–efficiency tradeoff that does not appear in evaluation benchmarks:
| Capacity Factor | Token Drop Rate | Output Quality | Max GPU Stall | Notes |
|---|---|---|---|---|
| 1.0 | ~8–15% | Noticeable degradation | Minimal | Aggressive, drops frequently |
| 1.25 | ~2–4% | Slight degradation | Moderate | Common production default |
| 1.5 | <0.5% | Near-lossless | High (worst-case 50% overhead) | Conservative, memory-heavy |
| 2.0 | ~0% | Lossless | Very high | Rarely used in serving |
The capacity buffer is a memory tax too: it pre-allocates expert input buffers of size capacity × d_model for every expert, on every GPU. For E=256 experts across 8 GPUs (32 experts per GPU), capacity_factor=1.25, d_model=7168, BF16: that is 32 × 1.25 × B/32 × 7168 × 2 bytes per GPU. At B=512, this is ~575 MB of buffer allocation per GPU, per MoE layer, that exists solely to absorb routing imbalance.
Cold Experts Are a KV Cache Problem for Weights
Here is the aspect of the router tax that connects most directly to the memory systems work elsewhere in this series: expert weights are not always hot in HBM.
For a large MoE model with E=256 experts, each expert's FFN weights occupy approximately 2–4 GB (depending on the model's hidden dimension per expert). Storing all expert weights for a 671B-parameter model requires the full ~1.3 TB of weight storage we calculated in the weight streaming essay. No single GPU holds all of it. In expert-parallel serving, each GPU holds E/EP experts. For EP=8, each GPU holds 32 experts × ~4 GB = ~128 GB — which already exceeds the H100's 80 GB HBM.
The consequence: expert weights are tiered. Frequently-routed ("hot") experts stay in HBM. Infrequently-routed ("cold") experts are evicted to DRAM or NVMe. When the router assigns a token to a cold expert, the system must:
- Detect the cold expert assignment (requires tracking expert residency alongside token dispatch)
- Stall the expert computation for that token (or the entire affected GPU slice) until the expert weights are loaded
- Issue a DMA prefetch from DRAM to HBM for the cold expert's weight tensors
- Complete the DMA before the expert GEMM can run
For a DRAM→HBM transfer at ~50 GB/s (PCIe P2P), loading a 4 GB expert takes ~80 ms. That is 80 ms of stall for the affected token. At B=64 with 8/256 experts activated per token, the probability that at least one token in the batch activates a cold expert is (1 - (n_hot/256)^8)^64. For n_hot=48 (75th percentile hot set for typical distributions), this is roughly 1 - (0.19)^8 ≈ almost certain for any batch above ~16 tokens.
Expert weight residency in HBM is exactly the same problem as KV cache residency — it is a working-set management problem with unpredictable access patterns driven by routing decisions. The right architecture applies the same tools: access frequency tracking, predictive prefetch using router history, LFU-with-aging eviction policy, and bandwidth budget coordination with the KV prefetch path. Most current serving frameworks treat expert weights as static after loading. That assumption breaks at large E and variable request distributions.
The Load-Balancing Loss Changes Routing — and That Has Inference Consequences
MoE models are trained with an auxiliary load-balancing loss that penalizes skewed routing distributions. The intent is to prevent expert collapse — where a few experts absorb all tokens and the rest atrophy. The loss encourages more uniform routing across training batches.
At inference time, the load-balancing loss does not run. The model routes based solely on the gate logits it learned during training. But those gate logits were shaped by a training objective that included the load-balancing penalty. The resulting routing distribution is more uniform than it would be without the loss — but it is not uniform, because the loss weight α is deliberately small (typically 10-2) to avoid sacrificing model quality for routing balance.
This creates a subtle but important inference-time phenomenon: the routing distribution depends on the training α, which was chosen for training stability, not inference efficiency. A lower α produces better model quality but more skewed routing and more load imbalance at inference time. A higher α produces more balanced routing but worse model quality. This tradeoff is made once at training time and cannot be adjusted at serving time without retraining.
Some recent work explores inference-time routing modifications — rescaling gate logits, temperature adjustments, or alternative top-k selection rules — to improve inference load balance without retraining. These are promising but they change the model's output distribution, which may or may not be acceptable depending on the application.
A Corrected Cost Model That Includes the Router Tax
The standard MoE cost model compares FLOPs: a 671B MoE with k=8 active experts costs ~37B-equivalent FLOPs per token, matching a 37B dense model. This is arithmetically correct. Here is a more complete cost model:
| Cost Component | Dense 70B | MoE 671B (k=8, E=256) | Notes |
|---|---|---|---|
| Expert GEMM FLOPs | 70B FLOPs/token | ~37B FLOPs/token | MoE wins 1.9× |
| Weight bandwidth (BF16) | 140 GB/step | ~74 GB active + cold miss | MoE wins IF no cold misses |
| Router compute | — | ~0.5% of layer time | Negligible |
| Dispatch all-to-all | — | 10–25% of layer time | MoE loses; scales with EP |
| Load imbalance stall | — | 5–30% of p99 latency | MoE loses; input-dependent |
| Capacity buffer memory | — | ~500 MB extra per GPU/layer | MoE loses; reduces KV headroom |
| Expert cache miss stall | — | Up to 80 ms per cold miss | MoE loses; distribution-dependent |
| Total effective latency (B=1) | ~41 ms | ~35–55 ms | MoE: 1.15–1.35× faster at best |
The net efficiency advantage of MoE in production is significantly smaller than the raw FLOP ratio suggests. The gap between 1.9× FLOP reduction and ~1.2–1.4× actual throughput improvement is the router tax.
What Helps and What Doesn't
What Actually Helps
- Expert parallelism within a node (NVLink) rather than across nodes (Ethernet/IB): NVLink all-to-all latency is ~10–50 µs. InfiniBand is ~5–50 ms. Keeping expert parallelism within the NVLink domain is 100–1000× better for dispatch latency. This constrains maximum EP to 8 (one DGX node) for latency-sensitive serving.
- Expert grouping by access pattern: Arrange expert weights so that the k most frequently co-activated experts are co-located on the same GPU. This reduces cross-GPU dispatch frequency, lowering all-to-all volume for the common case.
- Router history prefetch: Use the previous step's routing decisions (which are known at decode step N) to prefetch expert weights for decode step N+1. Because routing is temporally correlated for typical text sequences, prefetch accuracy is surprisingly high (~65–75% for top-4 predictions).
- Smaller E, larger k: Fewer experts with more activation per token reduces routing variance (load imbalance shrinks), reduces the expert cache footprint, and reduces communication volume while preserving most of the parameter efficiency benefit.
What Doesn't Help As Much As Expected
- Increasing EP beyond node boundary: Reduces per-GPU expert memory pressure but adds cross-node all-to-all latency that dominates the gain. The tradeoff is almost never favorable for decode latency at EP>8.
- Higher capacity_factor alone: Reduces token drops but increases memory pressure and worst-case stall time. It does not reduce dispatch latency or load imbalance — it just hides the quality impact of drops.
- Auxiliary loss tuning post-training: Cannot be done without retraining. The routing distribution is baked into the gate weights.
What the Router Tax Means for MoE Economics
MoE models are not a free lunch, and the router is not free. The real serving cost of a MoE model is the sum of: expert compute (the part the marketing counts), all-to-all communication overhead (15–25% of layer time), load imbalance tail latency (input-dependent, 5–30%), expert cache miss stalls (distribution-dependent, potentially severe), and capacity buffer memory tax (reduces KV headroom).
For well-tuned MoE serving with expert parallelism within NVLink, good router history prefetch, and carefully tuned capacity buffers, the total overhead over a well-optimized dense model of equivalent active parameter count is approximately 20–35%. That is small enough that MoE remains the dominant architecture for large-scale frontier models. But it is large enough to matter for $/token cost calculations and for the decision about whether to prefer a 671B MoE or a smaller dense model for a given latency SLO.
The router decides in microseconds. Its consequences last for the full layer forward pass across every GPU in the expert-parallel group. Accounting for those consequences accurately is the difference between a theoretical efficiency argument and a deployable serving system.