MoE Inference Routing Overhead Inference Economics

The Router Tax: What Nobody Accounts For in MoE Inference Cost

Published · 13 min read

Mixture-of-Experts models are sold as a 10× efficiency win: 10× more parameters, same per-token compute. That math is correct. But it omits the router — the gating network that decides which experts fire — and everything the router forces downstream. This essay accounts for the tax.

15–25%Overhead from router-induced all-to-all communication in expert-parallel serving
~5 msMinimum routing latency floor from dispatch → gather at 50+ experts on H100
3–4×Load imbalance amplification factor at high concurrency with non-uniform routing
O(E)Expert cache miss cost: cold experts require HBM re-load per activation, not amortized
01 — The Premise

The MoE Promise and What It Leaves Out

The MoE efficiency argument is this: replace each dense FFN layer with E expert FFNs. Route each token to k of them (k ≪ E). Pay per-token compute proportional to k active experts, not E total experts. Use total parameters (and therefore model capacity) proportional to E, but compute proportional to k. Win.

DeepSeek-V3 makes this concrete: 671B total parameters, 37B active per token with E=256, k=8. The served model has the representational capacity of a 671B dense model at the per-token compute cost of a 37B dense model. If you believe that benchmark, the math is genuinely remarkable.

But the benchmark is almost always measured on a single GPU, running one token at a time, in a controlled evaluation harness. Production serving is different. In production, you have expert parallelism across multiple GPUs, dynamic routing that creates load imbalance, capacity buffers that occasionally drop tokens, and a router that must make decisions — and communicate them — millions of times per second. The router is not free. And the costs it introduces are systemic.

The router's computational cost is trivial. The router's communication, load imbalance, cache miss, and tail latency costs are not. They are what the headline throughput numbers are measured in spite of, not because of.

02 — First Principles

What the Router Actually Does, Step by Step

Let us be precise about the router's operation in a production MoE serving system. For each token in the current batch, per MoE layer:

  • Gate computation: Compute logits = W_gate × hidden where W_gate has shape [d_model × E]. For d_model=7168, E=256: this is a 7168×256 matrix-vector product. At BF16: ~3.7 MB weight read, ~3.7 MFLOP. Trivially fast.
  • Top-k selection: Find the k=8 highest logits. Apply softmax over those k to get routing weights. For E=256, k=8: a sort/top-k over 256 floats. Also trivial.
  • Dispatch: Group tokens by their assigned experts. For expert-parallel serving (experts sharded across GPUs), this requires an all-to-all communication: each GPU sends the tokens assigned to remote experts, and receives tokens assigned to its local experts from remote GPUs. This is not trivial.
  • Expert GEMM: Each GPU runs GEMM for its local experts on the received token subset. This is the intended work: FFN forward pass on the active subset.
  • Gather: Collect expert outputs back to the originating GPUs via another all-to-all. Scale by routing weights and sum.

Steps 3 and 5 — dispatch and gather — are the router tax. They are all-to-all collective operations whose cost scales with the number of GPUs, the number of tokens, and the token-to-expert assignment pattern. They are also deeply unpredictable in latency because they depend on the actual routing decisions, which are input-dependent.

03 — The Communication Cost

All-to-All at Scale: The Dispatch/Gather Tax Quantified

For a production MoE serving system with expert parallelism degree EP, each MoE layer requires two all-to-all collective operations. The data volume per all-to-all is:

# Volume per all-to-all for one MoE layer
# B: batch size (tokens), d: model dimension, EP: expert parallel degree
bytes_per_dispatch = B * d_model * bytes_per_element
# Example: B=512, d=7168, BF16 (2 bytes)
bytes_per_dispatch_example = 512 * 7168 * 2  # = 7.3 MB

# Two all-to-alls per MoE layer (dispatch + gather)
bytes_per_moe_layer = 2 * bytes_per_dispatch  # = 14.6 MB

# Across L_moe MoE layers in the model
# DeepSeek-V3: ~58 MoE layers out of 61 total
total_alltoall_bytes = 58 * bytes_per_moe_layer  # = 847 MB per forward pass

# At 400 Gbps NVLink per direction:
# 847 MB / (400 Gbps / 8) = 847 MB / 50 GB/s = ~17 ms
# But this assumes perfect overlap with compute — rarely achieved

17 ms of all-to-all communication per forward pass, at batch size 512. At batch size 64 (more typical for long-context serving), this scales down: ~2 ms of communication time. But communication time does not overlap cleanly with compute because the expert GEMM cannot start until the dispatch all-to-all completes. The dispatch serializes with the compute. Only the gather can overlap with subsequent attention computation.

Fig 1 — MoE Layer Timeline: Where Time Goes
SINGLE MOE LAYER — DENSE DECODE (B=64) GPU 0 Gate Dispatch A2A Expert GEMM (local) Gather A2A Combine GPU 1 Gate Dispatch A2A STALL — waiting for remote tokens Expert GEMM Gather A2A Total MoE layer wall-clock Router tax: ~20-30% of layer time Communication (router tax) Expert compute (useful work) Stall (load imbalance)
MoE layer timeline across two GPUs in expert-parallel serving. The router tax (dispatch + gather all-to-all) is serialized before and after expert compute. GPU 1 shows a stall waiting for imbalanced token assignments from GPU 0. This pattern repeats for every MoE layer in the model.
04 — Load Imbalance

The Worst Property of Real Routing: It Is Never Uniform

The MoE efficiency math assumes that routing is perfectly balanced — each expert receives exactly B×k/E tokens per batch. In practice, routing distributions are highly skewed. Tokens with similar semantic properties activate similar experts. A batch of coding tokens will heavily load coding-specialized experts and leave math-specialized experts idle. A batch of multilingual tokens may have complex bimodal distributions across language-specialized expert subsets.

The consequence is structural. In expert-parallel serving, the total layer latency is determined by the slowest GPU — the one that received the most tokens from the all-to-all dispatch. An imbalanced batch where one GPU receives 3× the average token load forces all other GPUs to wait after completing their expert GEMMs.

The Tail Latency Amplification

At serving concurrency of 100 requests, the probability that at least one expert receives 2× the mean token load approaches 1 as batch size grows above ~64 tokens. Every such event adds a bubble to the entire MoE forward pass. Unlike GPU kernel latency — which is roughly deterministic — routing imbalance is input-dependent and cannot be predicted or pre-scheduled. It is the primary source of p99 tail latency spikes in production MoE serving.

The Capacity Buffer Tax

The standard mitigation for routing imbalance is the expert capacity buffer: each expert is allocated a capacity of C = (B/E) × capacity_factor tokens per step, where capacity_factor is typically 1.25. Tokens assigned to an expert that has already hit its capacity are dropped — they are not processed by any expert and their contribution to the output is zeroed out.

This introduces a quality–efficiency tradeoff that does not appear in evaluation benchmarks:

Capacity FactorToken Drop RateOutput QualityMax GPU StallNotes
1.0~8–15%Noticeable degradationMinimalAggressive, drops frequently
1.25~2–4%Slight degradationModerateCommon production default
1.5<0.5%Near-losslessHigh (worst-case 50% overhead)Conservative, memory-heavy
2.0~0%LosslessVery highRarely used in serving

The capacity buffer is a memory tax too: it pre-allocates expert input buffers of size capacity × d_model for every expert, on every GPU. For E=256 experts across 8 GPUs (32 experts per GPU), capacity_factor=1.25, d_model=7168, BF16: that is 32 × 1.25 × B/32 × 7168 × 2 bytes per GPU. At B=512, this is ~575 MB of buffer allocation per GPU, per MoE layer, that exists solely to absorb routing imbalance.

05 — Expert Cache Misses

Cold Experts Are a KV Cache Problem for Weights

Here is the aspect of the router tax that connects most directly to the memory systems work elsewhere in this series: expert weights are not always hot in HBM.

For a large MoE model with E=256 experts, each expert's FFN weights occupy approximately 2–4 GB (depending on the model's hidden dimension per expert). Storing all expert weights for a 671B-parameter model requires the full ~1.3 TB of weight storage we calculated in the weight streaming essay. No single GPU holds all of it. In expert-parallel serving, each GPU holds E/EP experts. For EP=8, each GPU holds 32 experts × ~4 GB = ~128 GB — which already exceeds the H100's 80 GB HBM.

The consequence: expert weights are tiered. Frequently-routed ("hot") experts stay in HBM. Infrequently-routed ("cold") experts are evicted to DRAM or NVMe. When the router assigns a token to a cold expert, the system must:

  • Detect the cold expert assignment (requires tracking expert residency alongside token dispatch)
  • Stall the expert computation for that token (or the entire affected GPU slice) until the expert weights are loaded
  • Issue a DMA prefetch from DRAM to HBM for the cold expert's weight tensors
  • Complete the DMA before the expert GEMM can run

For a DRAM→HBM transfer at ~50 GB/s (PCIe P2P), loading a 4 GB expert takes ~80 ms. That is 80 ms of stall for the affected token. At B=64 with 8/256 experts activated per token, the probability that at least one token in the batch activates a cold expert is (1 - (n_hot/256)^8)^64. For n_hot=48 (75th percentile hot set for typical distributions), this is roughly 1 - (0.19)^8 ≈ almost certain for any batch above ~16 tokens.

Expert Residency as a Scheduling Problem

Expert weight residency in HBM is exactly the same problem as KV cache residency — it is a working-set management problem with unpredictable access patterns driven by routing decisions. The right architecture applies the same tools: access frequency tracking, predictive prefetch using router history, LFU-with-aging eviction policy, and bandwidth budget coordination with the KV prefetch path. Most current serving frameworks treat expert weights as static after loading. That assumption breaks at large E and variable request distributions.

06 — Auxiliary Losses and Their Hidden Cost

The Load-Balancing Loss Changes Routing — and That Has Inference Consequences

MoE models are trained with an auxiliary load-balancing loss that penalizes skewed routing distributions. The intent is to prevent expert collapse — where a few experts absorb all tokens and the rest atrophy. The loss encourages more uniform routing across training batches.

At inference time, the load-balancing loss does not run. The model routes based solely on the gate logits it learned during training. But those gate logits were shaped by a training objective that included the load-balancing penalty. The resulting routing distribution is more uniform than it would be without the loss — but it is not uniform, because the loss weight α is deliberately small (typically 10-2) to avoid sacrificing model quality for routing balance.

This creates a subtle but important inference-time phenomenon: the routing distribution depends on the training α, which was chosen for training stability, not inference efficiency. A lower α produces better model quality but more skewed routing and more load imbalance at inference time. A higher α produces more balanced routing but worse model quality. This tradeoff is made once at training time and cannot be adjusted at serving time without retraining.

Some recent work explores inference-time routing modifications — rescaling gate logits, temperature adjustments, or alternative top-k selection rules — to improve inference load balance without retraining. These are promising but they change the model's output distribution, which may or may not be acceptable depending on the application.

07 — The MoE Inference Cost Model

A Corrected Cost Model That Includes the Router Tax

The standard MoE cost model compares FLOPs: a 671B MoE with k=8 active experts costs ~37B-equivalent FLOPs per token, matching a 37B dense model. This is arithmetically correct. Here is a more complete cost model:

Cost ComponentDense 70BMoE 671B (k=8, E=256)Notes
Expert GEMM FLOPs70B FLOPs/token~37B FLOPs/tokenMoE wins 1.9×
Weight bandwidth (BF16)140 GB/step~74 GB active + cold missMoE wins IF no cold misses
Router compute~0.5% of layer timeNegligible
Dispatch all-to-all10–25% of layer timeMoE loses; scales with EP
Load imbalance stall5–30% of p99 latencyMoE loses; input-dependent
Capacity buffer memory~500 MB extra per GPU/layerMoE loses; reduces KV headroom
Expert cache miss stallUp to 80 ms per cold missMoE loses; distribution-dependent
Total effective latency (B=1)~41 ms~35–55 msMoE: 1.15–1.35× faster at best

The net efficiency advantage of MoE in production is significantly smaller than the raw FLOP ratio suggests. The gap between 1.9× FLOP reduction and ~1.2–1.4× actual throughput improvement is the router tax.

08 — Architectures That Minimize the Tax

What Helps and What Doesn't

What Actually Helps

  • Expert parallelism within a node (NVLink) rather than across nodes (Ethernet/IB): NVLink all-to-all latency is ~10–50 µs. InfiniBand is ~5–50 ms. Keeping expert parallelism within the NVLink domain is 100–1000× better for dispatch latency. This constrains maximum EP to 8 (one DGX node) for latency-sensitive serving.
  • Expert grouping by access pattern: Arrange expert weights so that the k most frequently co-activated experts are co-located on the same GPU. This reduces cross-GPU dispatch frequency, lowering all-to-all volume for the common case.
  • Router history prefetch: Use the previous step's routing decisions (which are known at decode step N) to prefetch expert weights for decode step N+1. Because routing is temporally correlated for typical text sequences, prefetch accuracy is surprisingly high (~65–75% for top-4 predictions).
  • Smaller E, larger k: Fewer experts with more activation per token reduces routing variance (load imbalance shrinks), reduces the expert cache footprint, and reduces communication volume while preserving most of the parameter efficiency benefit.

What Doesn't Help As Much As Expected

  • Increasing EP beyond node boundary: Reduces per-GPU expert memory pressure but adds cross-node all-to-all latency that dominates the gain. The tradeoff is almost never favorable for decode latency at EP>8.
  • Higher capacity_factor alone: Reduces token drops but increases memory pressure and worst-case stall time. It does not reduce dispatch latency or load imbalance — it just hides the quality impact of drops.
  • Auxiliary loss tuning post-training: Cannot be done without retraining. The routing distribution is baked into the gate weights.
09 — The Bottom Line

What the Router Tax Means for MoE Economics

MoE models are not a free lunch, and the router is not free. The real serving cost of a MoE model is the sum of: expert compute (the part the marketing counts), all-to-all communication overhead (15–25% of layer time), load imbalance tail latency (input-dependent, 5–30%), expert cache miss stalls (distribution-dependent, potentially severe), and capacity buffer memory tax (reduces KV headroom).

For well-tuned MoE serving with expert parallelism within NVLink, good router history prefetch, and carefully tuned capacity buffers, the total overhead over a well-optimized dense model of equivalent active parameter count is approximately 20–35%. That is small enough that MoE remains the dominant architecture for large-scale frontier models. But it is large enough to matter for $/token cost calculations and for the decision about whether to prefer a 671B MoE or a smaller dense model for a given latency SLO.

The router decides in microseconds. Its consequences last for the full layer forward pass across every GPU in the expert-parallel group. Accounting for those consequences accurately is the difference between a theoretical efficiency argument and a deployable serving system.