Sparse MQA + Fused MoE Mega Kernel + Hyper-connections: The Trifecta Behind Modern Frontier Models
Three architectural innovations each solve one wall that stops dense transformers from scaling — memory, compute, and depth. Together they unlock 2M-token context, trillion-parameter capacity, and 120-layer depth on existing H100 clusters. Here is exactly how each one works and why they are inseparable.
1. Introduction: Solving the Three Walls #
Scaling laws hold, but naive dense scaling breaks in three independent ways. Every lab building GPT-5-class systems has converged on some variant of this trifecta — not because it is fashionable, but because each wall is a hard physical constraint that no amount of additional compute can paper over.
KV-cache grows as O(S × n_heads × D) per layer. For a 70B model with 64 heads at 128-dim across 80 layers at FP16, a 1M-token context requires roughly 420 GB of KV cache — far beyond any single GPU's HBM. Inference becomes IO-bound before any arithmetic begins.
Dense models activate 100% of parameters per token. A 1T dense model requires roughly 2 × 10¹⁸ FLOPs per token. At 100 tokens/second that is 200 exaFLOPs/second — requiring thousands of H100s for a single inference request. Power and cost make this untenable at scale.
Beyond 80–100 layers, gradients reach early layers attenuated by repeated multiplication. Under standard Pre-LN residuals, ∂L/∂x₀ ∝ (1 + ∂F/∂x)^L — for realistic values of ∂F/∂x, signal vanishes before training converges. Models simply stop improving with depth.
Sparse MQA attacks the memory wall by collapsing KV to a single shared head with learned sparsity patterns. Fused MoE Mega Kernels attack the compute wall by activating a sparse subset of parameters per token without paying the kernel launch and HBM round-trip overhead of naive MoE. Hyper-connections attack the depth wall by creating direct learned pathways from the loss to every layer, making gradient highways that bypass the vanishing problem entirely.
2. Sparse MQA: Killing the KV-Cache Tax #
MHA → GQA → MQA: The KV Head Reduction Story
Multi-Head Attention allocates separate K and V projection matrices per head. For sequence length S, batch B, H heads, and head dimension D, the KV-cache size is:
# MHA: H separate K/V heads
KV_size = 2 × B × S × H × D × dtype_bytes
# 70B model: B=1, S=1M, H=64, D=128, FP16 (2 bytes)
# KV_size ≈ 2 × 1 × 1M × 64 × 128 × 2 = 32 GB per layer
# × 80 layers = 2.56 TB total — far beyond H200's 141 GB HBM
# GQA (Grouped Query Attention): G groups, G << H
KV_size = 2 × B × S × G × D × dtype_bytes
# Llama-3.1 405B: G=8 heads → 8x KV reduction vs MHA
# MQA (Multi-Query): single shared K/V head
KV_size = 2 × B × S × 1 × D × dtype_bytes
# At H=64: 64x KV reduction vs MHA. Quality cost ~1–2% on benchmarks.
The reduction is substantial: GQA with 8 KV heads (as used in Llama-3 models) achieves 8× KV compression. Full MQA achieves 64× at H=64, enabling the same GPU memory to serve far longer contexts or larger batches. The quality penalty is modest for most tasks but measurable on long-document reasoning benchmarks.
Adding Sparsity: Local+Sink and A² Routing
Pure MQA still scales linearly with context length — it just scales with a smaller constant. For million-token contexts, even MQA's single KV head can overflow HBM. Attention sparsity cuts the growth from O(S) to constant or O(S·K/S) = O(K). Two dominant patterns:
Retain k_sink initial tokens (shown empirically to stabilise attention distributions) plus a sliding window of W recent tokens. KV-cache becomes O(k_sink + W) — constant with respect to total sequence length. First demonstrated at scale in StreamingLLM (Xiao et al., ICLR 2024).
Tradeoff: tokens outside the window are permanently invisible. Acceptable for streaming tasks; degrades on tasks requiring exact retrieval from mid-sequence.
Learn a lightweight router that selects top-K keys per query based on a compressed representation. Complexity drops from O(S²) to O(S·K). For S=1M and K=256, this is a 4,000× FLOPs reduction in attention compute.
Tradeoff: router adds a forward pass per attention layer. Routing quality is critical — missed relevant keys cannot be recovered.
Combined Sparse MQA — single KV head with local+sink sparsity pattern — is the mechanism by which models in the Gemini 1.5 class achieve million-token contexts without KV cache exceeding the GPU memory budget. A 2M-token context with this approach can fit in under 500 GB of KV memory vs. the multi-terabyte requirement of naive MHA.
3. Fused MoE Mega Kernel: Making Sparsity Fast #
Active vs Total Parameters: Why MoE Changes the Cost Function
Mixture-of-Experts replaces the FFN block in each transformer layer with N expert networks. A learned router selects top-k experts per token (typically k=2). Each token only passes through k experts, regardless of N.
| Model | Total Params | Active Params / token | Active / Total ratio |
|---|---|---|---|
| Mixtral 8×7B | 46.7B | 12.9B | 28% |
| DBRX | 132B | 36B | 27% |
| DeepSeek-V3 | 671B | 37B | 5.5% |
| Hypothetical 1.8T MoE | 1,800B | ~400B | ~22% |
The economics are stark: DeepSeek-V3 delivers 671B total parameters — with access to the full representational capacity of a frontier-class model — while activating only 37B parameters per token. Serving cost scales with active parameters, not total. Inference pricing should be quoted against active parameters.
Why Naive MoE Is Slow: The 7-Kernel Problem
A straightforward PyTorch loop launches seven separate CUDA kernels per layer and streams expert weights from HBM multiple times:
# Naive: 7 kernel launches, 7 HBM round-trips per layer
logits = x @ W_gate # kernel 1: gate GEMM
topk_idx = topk(logits, k=2) # kernel 2: sort + select
for i in range(k):
x_i = gather(x, topk_idx[i]) # kernel 3, 5: irregular gather
y_i = expert_ffn_i(x_i) # kernel 4, 6: expert GEMM (HBM load)
x_out = scatter_add(y_i, weights) # kernel 7: weighted combine
# Problems:
# - Each kernel launch: ~5–10µs overhead
# - Expert weights reloaded from HBM per kernel: no L2 residency
# - Dispatch imbalance causes tail latency: slow experts block fast ones
# - At B=256, S=2048: ~30% slower than equivalent dense layer
Mega Kernel Fusion: Three Key Tricks
A Mega Kernel fuses all seven operations into a single persistent CUDA kernel that launches once and handles all experts. Three tricks make it work:
Keep the most frequently routed expert weights in H100's 50 MB L2 cache across the entire batch. For FP8 experts at 2B parameters each, an expert's FFN weights occupy ~4 GB — far too large for L2. Instead, keep the top-8 most active experts' hot tiles (the frequently reused GEMM tiles) pinned in L2, and stream the remaining experts through SMEM. H100's L2 is large enough to hold the working set for 8–16 FP8 expert tiles at once.
CUTLASS grouped GEMM executes multiple matrix multiplications with different shapes in a single kernel launch. All active experts for the current token batch run in one call with dynamic shapes, entirely avoiding the per-expert kernel launch overhead. This alone accounts for roughly 1.4× of the total speedup vs naive implementation.
The H100's Tensor Memory Accelerator enables asynchronous HBM→SMEM copies that execute concurrently with tensor core arithmetic. The kernel issues TMA loads for Expert N+1's weight tiles while Expert N's compute is in flight — hiding the ~200-cycle HBM latency entirely behind computation. This is described in detail in Section 5.
Combined effect: 2.1× throughput vs naive MoE, and 1.3× vs dense at equal active parameter counts. MoE is no longer just cheaper than dense — it is faster. TRT-LLM and MegaBlocks implement production variants of this.
4. Hyper-connections: Training 120+ Layer Models #
Why Standard Residuals Fail at Depth
Pre-LN transformers use residual connections of the form x_{l+1} = x_l + F_l(LN(x_l)). The gradient of the loss with respect to the input at layer l=0 is the product of Jacobians across all subsequent layers:
∂L/∂x_0 = ∏_{l=0}^{L-1} (I + ∂F_l/∂x_l)
# If ||∂F_l/∂x_l|| < 1 consistently: gradient → 0 (vanishing)
# If ||∂F_l/∂x_l|| > 1 consistently: gradient → ∞ (exploding)
# At L=120: even a mean factor of 0.99 per layer gives 0.99^120 = 0.30
# → 70% signal attenuation to layer 0. Training converges poorly.
Empirically, standard Pre-LN transformers above approximately 80 layers for models above 10B parameters show training instability, slower convergence, and diminishing returns from additional depth. This caps the practical depth of naive architectures well below what is theoretically desirable.
Hyper-connections: Learned Gradient Highways
Hyper-connections (DeepSeek AI, arXiv:2409.19606, September 2024) generalize the residual connection by introducing learned per-layer weight vectors that allow each layer to attend to multiple prior representations, not just the immediately preceding one. The core formulation:
# Standard Pre-LN residual
x_{l+1} = x_l + F_l(LN(x_l))
# Hyper-connection (simplified scalar form)
# alpha, beta are learned parameters — initialised near 1/L for stability
x_2 = alpha_{2,0} * x_0 + alpha_{2,1} * x_1 + F_2(LN(x_1))
x_3 = alpha_{3,0} * x_0 + alpha_{3,1} * x_1 + alpha_{3,2} * x_2 + F_3(LN(x_2))
# Full form: connection width n, expansion factor r
# Each layer l receives a weighted sum of n prior hidden states
# The weight matrix W_c (n×n) is learned; initial value = I/n
h_l = W_c[l] @ [x_{l-n}, ..., x_{l-1}] # blend prior n states
x_l = h_l + F_l(LN(h_l))
The mechanism provides three guarantees:
- → Direct gradient path to all layers. Because layer 120's output depends directly on layer 0's output through the learned alpha terms, the gradient can reach layer 0 in a single step regardless of depth. Vanishing gradients across the full depth are structurally impossible.
- → Feature reuse across depth. Layer 80 can directly access token embeddings from layer 0 when the learned alpha weights place significant mass on early-layer representations. This is architecturally equivalent to the DenseNet insight applied to transformers.
- → Learned layer skipping. If a layer's output is destabilising (e.g. during training), the model can learn to down-weight that layer's contribution through its alpha parameters while maintaining gradient flow through other paths. This is automatic regularisation of depth.
The DeepSeek hyper-connections paper reports zero training divergence events at 100+ layers with hyper-connections vs approximately 40% divergence rate for equivalent Pre-LN architectures. The parameter overhead is small: for connection width n=4 and model width d, the additional parameters per layer are n × d — roughly 0.3% of total parameter count for typical models.
5. Hardware Specifics: H100 TMA + FP8 / FP4 for MoE #
TMA and Async Copies: Hiding Dispatch Latency
The fundamental bottleneck in MoE is not arithmetic — it is the irregular memory access pattern of token dispatch. Token 5 routes to Expert 3, token 6 to Expert 17, token 7 back to Expert 3: the memory access pattern is unpredictable and cannot be coalesced into efficient sequential HBM reads. This creates latency rather than bandwidth pressure.
H100's Tensor Memory Accelerator (TMA) provides cp.async.bulk.tensor — an asynchronous bulk copy instruction that transfers tensor tiles from HBM to shared memory at up to 512 bytes/cycle (vs 128 bytes/cycle for standard cp.async). Critically, TMA operations proceed concurrently with tensor core execution. The Mega Kernel uses a software pipeline to overlap Expert N's compute with Expert N+1's weight load:
// CUDA C++ — Mega Kernel TMA pipeline (simplified)
__shared__ half smem_weights_A[EXPERT_TILE_SIZE];
__shared__ half smem_weights_B[EXPERT_TILE_SIZE]; // double buffer
cuda::pipeline pipe = cuda::make_pipeline();
// Stage 0: start loading expert[0] weights asynchronously
cuda::memcpy_async(smem_weights_A, &gmem_expert[0],
CUTE_TMA_LOAD, pipe.producer_acquire());
pipe.producer_commit();
for (int expert = 0; expert < n_active_experts; ++expert) {
// While executing expert[expert-1]...
// ...prefetch expert[expert] weights in background
if (expert + 1 < n_active_experts) {
auto& next_buf = (expert % 2 == 0) ? smem_weights_B : smem_weights_A;
cuda::memcpy_async(next_buf, &gmem_expert[expert+1],
CUTE_TMA_LOAD, pipe.producer_acquire());
pipe.producer_commit();
}
pipe.consumer_wait(); // wait for this expert's weights
__syncthreads(); // SMEM coherence
// Tensor core GEMM — expert weights already in SMEM
compute_expert_gemm(smem_weights_current, token_tiles);
}
// Result: HBM load latency (~200 cycles) fully hidden behind 256-token GEMM
The net effect: expert weight load from HBM is completely hidden behind tensor core arithmetic. The kernel is now compute-bound, not memory-bandwidth-bound. This is the primary reason H100 MoE throughput is approximately 2× that of A100 for the same model — not clock frequency or FLOP count, but the TMA enabling this overlap.
FP8 E4M3 vs E5M2: Right Format for the Right Tensor
FP8 is not a single format — it is two distinct 8-bit floating-point formats with different precision/range tradeoffs. Using the wrong format for the wrong tensor is a silent quality failure.
| Format | Exponent bits | Mantissa bits | Max value | Best for |
|---|---|---|---|---|
| E4M3 | 4 | 3 | ±448 | Expert weights, activations — needs precision, not range |
| E5M2 | 5 | 2 | ±57,344 | Router logits, gradients — needs range, not precision |
MoE softmax routing logits can span a wide dynamic range — the difference between a strongly preferred and strongly avoided expert can be several orders of magnitude. E4M3 caps at 448; values exceeding this saturate to the maximum, causing softmax to collapse to near-uniform distribution. The router loses its discriminative ability entirely. This is a silent bug — the model runs, loss looks reasonable, but expert assignment degrades to near-random for tokens with high-magnitude routing signals.
During training, log the fraction of expert weight values whose FP32 magnitude exceeds 448 (E4M3 saturation threshold) every 100 steps, per layer. Any layer where saturation fraction exceeds ~0.1% warrants investigation. TRT-LLM handles format selection automatically, but custom training loops using Transformer Engine must configure this explicitly per tensor group.
H100 tensor cores support mixed FP8 formats natively: the input operands can be E4M3 and E5M2 independently, with FP32 accumulation. No explicit dequantization kernel is required.
FP4 Block-Scaling: The B200 Kernel Rewrite
Blackwell B200 adds FP4 precision (NVFP4): 4-bit values with one shared scale factor per 16-element block. This provides 2× weight bandwidth vs FP8, enabling more experts or larger experts for the same HBM bandwidth budget. But it requires a fundamentally different kernel memory layout.
- →Layout change. FP8 weights are stored as contiguous row-major tensors. FP4 weights must be stored as interleaved [blockscales | packed_fp4_values] structures. TMA descriptors must load scales and values as a single bulk tensor to avoid two separate HBM transactions per block.
- →Dequant in register. FP4 values are dequantized to BF16 in register during the GEMM epilogue — keeping the L2 footprint in FP4 but performing arithmetic in BF16. The index math overhead prevents keeping dequantized values in L2.
- →Tile size constraints. Grouped GEMM tile sizes must be multiples of 16 (one block = 16 elements). This sets a floor on minimum expert size and forces expert count down or parameters per expert up for B200-optimised models.
The architectural implication: B200 shifts the optimal MoE design toward more experts with fewer parameters each, since FP4 makes bandwidth-bound expert weight reads cheaper. Expect models designed for B200 to push toward 256+ expert counts rather than the 64–128 common on H100.
6. Non-determinism and Auditability: The Routing Problem #
Why MoE Routing Is Inherently Non-Deterministic
MoE routing computes softmax(W_gate @ x) and selects top-k. Two sources of non-determinism make reproducibility difficult:
TMA async GEMMs and warp-level reductions reorder floating-point additions depending on batch size and warp scheduling. The same prompt at batch=1 vs batch=256 produces different FP32 accumulation order, which propagates to different expert selections for tokens near the top-k decision boundary. This is not a bug — it is a consequence of IEEE 754 non-associativity and is expected behaviour for any system using FP32 reductions at high concurrency.
Token "Paris" at position 5 may route to experts {3, 17} in run A but {8, 19} in run B due to the above. The downstream attention layers see different expert outputs — logit differences of ~10⁻⁴ that compound over 2,048 tokens to produce measurably different final output distributions. At 2M context, this drift is substantial.
Practical Mitigation Strategies
Cache the expert trace from a canonical first run as a [batch, seq_len, k] uint16 tensor, and replay those exact expert assignments for subsequent evaluation runs on the same prompt. Storage cost: 4 MB per 1M-context request at k=2. Acceptable for eval pipelines; not practical for production serving.
Log {layer_id, token_pos, expert_ids[], routing_probs[]} to a sidecar process for each request. Latency overhead ~3% for the write path. Allows post-hoc reconstruction of the model's reasoning path — important for regulatory purposes. Particularly relevant for the EU AI Act's requirements on high-risk system transparency.
Expose a deterministic=true serving parameter that forces synchronous GEMM accumulation order (disabling TMA async reordering). Expected throughput penalty: ~15% reduction. Justified for financial, healthcare, and scientific use cases where identical outputs for identical inputs are a hard requirement. Price deterministic mode at a premium in API offerings.
7. Putting It All Together: The Trifecta Is Multiplicative #
Each innovation enables the next. Remove any one and the system collapses back to hitting the wall it solves.
| Component | Wall Solved | Mechanism | Why the Next Component Depends on It |
|---|---|---|---|
| Sparse MQA | Memory: KV-cache O(S) → O(1) | Single shared KV head + local/sink sparsity pattern | MoE requires all active experts to read KV state. Without Sparse MQA, 128 experts × full-context KV overflows HBM at 128k+ context. MoE cannot deploy at long context without it. |
| Fused MoE Mega Kernel | Compute: FLOPs/token → active-params cost | Top-2 routing + TMA prefetch + grouped GEMM = 2.1× throughput | Enables 120-layer depth economically. A 120-layer dense model costs 5× a 24-layer dense model. A 120-layer MoE model with 5% activation costs the same compute as a 6-layer dense model. Depth becomes free. |
| Hyper-connections | Depth: gradient vanishing → O(1) gradient path | Learned residual weight vectors create direct gradient highways to all layers | Makes 120 trainable layers achievable. Without it, MoE's compute efficiency advantage at depth is unrealisable because training diverges before the model converges. Depth × MoE efficiency only materialises together. |
A representative modern frontier model combining all three: 2M token context via Sparse MQA, 1.8T total / ~400B active parameters via MoE, 120 layers via hyper-connections — running on an 8,000-GPU H100 cluster. Without all three simultaneously, you are forced to pick two: long context or deep model or large model, but not all three on the same hardware.
8. What This Means for Builders #
ML Engineers
- → Implement GQA before MQA. GQA with 8 KV groups gives 80% of MQA's memory saving with less than 0.5% quality loss on most tasks. Full MQA requires careful fine-tuning. Use FlashAttention-3 with num_key_value_heads=8 as the starting point.
- → MoE requires custom kernels from day one. Do not prototype with vanilla PyTorch loops — the performance gap vs Mega Kernels is 2×, which will mislead your architectural decisions. Start with MegaBlocks or Tutel. Profile TMA pipeline utilisation with nsys to verify async overlap is actually occurring.
- → Monitor FP8 scale factors obsessively. Log the histogram of absolute values per tensor group per 100 training steps. E4M3 saturation in router logits (values > 448) is a silent bug that degrades routing to near-random without crashing training. Set up an alert at >0.1% saturation rate.
- → Hyper-connections are ten lines of code. Add a learned alpha vector of shape [n_prior_layers] per layer. Initialise all weights to 1/n. Regularise with L2 on the alpha norms if you see instability. The gradient benefit is immediate at depth > 60 layers.
Technical PMs
- → Total parameters ≠ inference cost. When vendors quote model size, always ask for active parameters per token. DeepSeek-V3 at 671B total activates 37B — comparable to a 40B dense model. A 1.8T MoE at 22% activation costs similarly to a 400B dense. Do not overpay inference vendors who quote total parameters in their pricing materials.
- → Determinism is a feature, price it accordingly. Non-deterministic MoE routing means the same prompt returns different outputs across runs at the ~10⁻⁴ logit level. For finance, healthcare, and scientific applications this is unacceptable. Architect a deterministic=true mode with a defined latency penalty (expect ~15%), and price it at a premium in your SLA.
- → Context length degrades past training length × 8. Sparse MQA extrapolates, but not infinitely. A model trained on 256k context using RoPE + sparse attention will show measurable quality degradation past approximately 2M tokens. Budget fine-tuning cycles if your application requires near-training-length quality at 8× the training context.
- → B200 / FP4 changes expert design space. Models optimised for B200's FP4 will trend toward more experts (256+) with fewer parameters each. This changes routing granularity, load balancing requirements, and inter-GPU communication patterns. If you are planning a 2026 model training run targeting B200, factor in a full Mega Kernel rewrite — the FP4 layout is incompatible with FP8 kernels.
9. Closing Thought #
The bitter lesson told us scale wins. The 2024–2026 corollary is that sparsity wins at scale.
Dense transformers were the proof of concept. Sparse MQA + fused MoE + hyper-connections are the engineering that makes the proof of concept economically sustainable. Without all three, you are picking two walls to solve and accepting the third as your ceiling — the memory ceiling, the compute ceiling, or the depth ceiling.
The new bottlenecks are not architectural. They are data quality, alignment, and HBM supply chain. The architecture problem — how to train a 120-layer, trillion-parameter, 2M-context model on commodity GPU clusters — is largely solved. Build accordingly.
References: StreamingLLM (Xiao et al., ICLR 2024) · Hyper-Connections (DeepSeek AI, arXiv:2409.19606) · DeepSeek-V3 Technical Report (2024) · NVIDIA H100 Hopper Architecture Whitepaper · CUTLASS Grouped GEMM Documentation · MegaBlocks (Gale et al., 2023)