Weight Streaming: Why Your Model's Weights Are the Other Decode Bottleneck
The KV cache gets all the attention. But for large models at low batch sizes, loading transformer layer weights from HBM dominates decode latency just as much. This essay derives the weight bandwidth problem from first principles, shows where current systems leave performance on the table, and describes what a weight-streaming architecture looks like.
Everything We Talk About Is KV Cache. Here Is What We Are Not Talking About.
The standard narrative for LLM decode optimization goes like this: autoregressive generation is memory-bound, the dominant memory pressure is the KV cache, and the solution space is KV compression, GQA, paging, tiering, and eviction policy. That narrative is correct. But it is incomplete in a way that becomes consequential at large model scale and low batch sizes.
Consider what happens during a single decode step for a 70B BF16 model at batch size 1. The GPU must: load every transformer layer's weight matrices from HBM into the register file and L1 cache; compute a matrix-vector product with the current token's hidden state; apply the attention mechanism to a KV cache of some size; and emit a logit distribution for sampling.
The weight matrices dominate. At BF16, a 70B-parameter model occupies approximately 140 GB of HBM. Every single decode step requires reading all 140 GB across the HBM bus. At H200's 3.35 TB/s bandwidth, that takes 41 milliseconds of pure memory time — before any arithmetic, before any KV cache reads, before any software overhead.
For a single decode step: FLOPs ≈ 2 × parameters = 2 × 70B = 140 GFLOPs. Bytes read ≈ 2 × parameters (BF16) = 140 GB weights + KV reads. Arithmetic intensity = FLOPs / Bytes ≈ 1 FLOP/byte. H200 ridge point: ~148 FLOPs/byte. The GPU is operating at less than 1% of its compute efficiency. It is a 989 TFLOP/s machine being used as a 3.35 TB/s memory bus.
This is the weight bandwidth problem. It is not new — it has been known since the first generation of large-model serving work. But its implications are underappreciated, particularly for the question of how to architect weight residency across a multi-tier memory hierarchy.
At batch size 1, loading model weights from HBM costs the same as loading the KV cache for a 32K-token context. Neither is free. Neither can be ignored.
Where the Bytes Come From: A Layer-by-Layer Breakdown
To understand the weight bandwidth problem precisely, we need to account for where the bytes actually come from in a transformer layer. Take a 70B-class model with the LLaMA-3 architecture as a concrete reference: 80 layers, model dimension 8192, GQA with 64 query heads and 8 KV heads, SwiGLU FFN with intermediate dimension ~28672.
Per transformer layer, the weight matrices are:
| Matrix | Shape | BF16 Size | Purpose |
|---|---|---|---|
| W_q | 8192 × 8192 | 128 MB | Query projection |
| W_k | 8192 × 1024 | 16 MB | Key projection (8 KV heads) |
| W_v | 8192 × 1024 | 16 MB | Value projection |
| W_o | 8192 × 8192 | 128 MB | Output projection |
| W_gate | 8192 × 28672 | 448 MB | SwiGLU gate branch |
| W_up | 8192 × 28672 | 448 MB | SwiGLU up branch |
| W_down | 28672 × 8192 | 448 MB | FFN down projection |
| Per-layer total | ~1.63 GB | Dominates FFN (84% of per-layer weight) | |
Across 80 layers: ~130 GB. Plus embeddings, LM head, and norms: ~140 GB total. This entire volume must cross the HBM bus every decode step, because the GPU's on-chip SRAM (50 MB L2 + 256 KB register file per SM) is far too small to hold more than a tiny fraction of it.
The FFN Dominates
Notice the asymmetry above: the FFN weights (W_gate + W_up + W_down) account for roughly 1.34 GB per layer — 84% of per-layer weight. The attention weights (W_q + W_k + W_v + W_o) account for the remaining 16%. This is not an accident. SwiGLU FFNs with a 3.5× expansion ratio are parameter-heavy by design; the tradeoff is quality per FLOP.
For weight streaming optimization, this asymmetry matters. Compressing or quantizing the FFN matrices delivers more bandwidth relief per unit of implementation complexity than compressing the attention matrices. A 4-bit FFN with a 16-bit attention module is a reasonable design point. The reverse is less compelling.
Why Batch Size Is the Free Variable That Hides the Problem
The textbook answer to the weight bandwidth problem is: increase batch size. If you process B tokens simultaneously, you pay the 140 GB weight-read cost once but generate B output tokens — amortizing the bandwidth cost by B. The arithmetic intensity scales as:
# Arithmetic intensity scales linearly with batch size
intensity(B) = FLOPs / Bytes = (2 × params × B) / (2 × params + B × KV_footprint)
# For large B, → FLOPs/KV_bytes ≈ B × model_dim (compute-bound)
# For B=1, → ~1 FLOP/byte (deep memory bound)
# Ridge point for H200: ~148 FLOPs/byte, requires B ≈ 148 for full utilization
B_ridge = 148 * (2 * params) / (2 * params) # ≈ 148 tokens simultaneously
This analysis implies that a batch of 148 tokens would keep the H200 tensor cores fully occupied during decode. That sounds achievable. But it requires 148 concurrent sessions, each contributing exactly one decode step to the same batch. In practice, request arrivals are Poisson-distributed, context lengths vary, and prefill/decode mixing creates contention. Achieving B=148 consistently is harder than the math suggests.
| Batch Size | Arithmetic Intensity | Decode Latency (70B) | GPU Efficiency | Notes |
|---|---|---|---|---|
| 1 | ~1 FLOP/B | ~41 ms | 0.7% | Single user, terrible efficiency |
| 8 | ~8 FLOP/B | ~42 ms | 5% | 8× throughput, same latency |
| 32 | ~32 FLOP/B | ~43 ms | 22% | Still memory-bound |
| 128 | ~128 FLOP/B | ~46 ms | 86% | Near ridge point |
| 256 | ~148 FLOP/B | ~65 ms | ~100% | Compute-bound, latency rises |
The punchline is that for batch sizes below about 64, the decode step is dominated by weight reads, not by arithmetic. Every optimization effort focused on KV cache for these workloads is addressing the second bottleneck, not the first.
INT4 Weights Are Not About Accuracy. They Are About Bandwidth.
This is the most important reframe in this essay. The motivation for weight quantization in LLM serving is widely described in terms of model compression, memory footprint, or cost reduction. These are real benefits. But the primary mechanism of action for decode throughput improvement is bandwidth reduction.
INT4 weights occupy half the bytes of BF16. Reading 140 GB of BF16 weights takes 41 ms at 3.35 TB/s. Reading the same model in INT4 takes 21 ms. That is a 2× decode throughput improvement at B=1, which degrades gradually as batch size increases (because the benefit amortizes). At B=32, the improvement is still ~1.8×. At B=128, it is ~1.2×. The entire "LLM.int8 is lossless, GPTQ is nearly lossless" literature is really an argument about whether we can capture the bandwidth benefit without sacrificing model quality — not an argument about compression for its own sake.
Weight quantization from BF16 to INT4 is a 4× bandwidth multiplier on the weight-read path. At B=1 and H200 bandwidth, this is equivalent to upgrading from an H200 (3.35 TB/s) to a hypothetical 13.4 TB/s memory system that does not exist. No hardware upgrade achieves this. Quantization is the only practical path to weight-bandwidth relief at current silicon economics.
The Accuracy-Bandwidth Tradeoff Landscape
| Format | Bytes/param | Weight BW (70B, H200) | Decode ms (B=1) | Relative Quality |
|---|---|---|---|---|
| BF16 | 2 | 140 GB | ~41 ms | Baseline |
| FP8 | 1 | 70 GB | ~21 ms | ~0.1% degradation |
| INT8 (W8A16) | 1 | 70 GB | ~21 ms | Nearly lossless |
| INT4 (GPTQ) | 0.5 | 35 GB | ~10 ms | 0.5–2% perplexity rise |
| INT4 (AWQ) | 0.5 | 35 GB | ~10 ms | <0.5% perplexity rise |
| INT2 (AQLM) | 0.25 | 17.5 GB | ~5 ms | 2–5% perplexity rise |
What a Weight-Streaming System Actually Looks Like
Current serving systems load the full model into HBM at startup and keep it resident for the lifetime of the server process. This is optimal when HBM is large enough to hold both the model weights and the KV cache for the concurrent request load — which was true for 7B–13B models on 80 GB H100s, but is no longer true for 70B+ models at long context or high concurrency.
A weight-streaming architecture abandons the assumption of full static residency. Instead, it treats weight tensors as data objects with known access patterns, and manages their placement across the memory hierarchy just as the KV cache manager manages KV pages.
Layer Ordering Gives You Perfect Prefetch
Unlike the KV cache — whose access patterns are partially unpredictable (request length, tool calls, speculative rollbacks) — weight access during decode is completely deterministic. Layer 0 is always followed by layer 1. Layer 1 is always followed by layer 2. The weight streaming engine knows the entire access schedule in advance and can issue DMA prefetches with arbitrarily long lookahead.
This is the key advantage weight streaming has over KV streaming: prediction is trivial. The only complexity is bandwidth budget management — ensuring that weight DMA transfers do not starve the KV prefetch path that is also competing for HBM write bandwidth.
# Pseudocode: weight streaming loop
for decode_step in range(max_tokens):
for layer_idx in range(num_layers):
# Fire DMA for layer+2 while computing layer
if layer_idx + 2 < num_layers:
dma.prefetch_async(
src=weight_store.get(layer_idx + 2),
dst=hbm_staging_buffer,
priority=WEIGHT_STREAM,
bandwidth_cap=bw_budget * 0.4 # leave 60% for KV
)
# Wait only if prefetch hasn't landed yet (rare)
weights = hbm_staging_buffer.wait(layer_idx)
# Decompress INT4 → BF16 in register during GEMV setup
weights_bf16 = dequantize(weights, scale=scale_factors[layer_idx])
# Execute GEMV
hidden = matmul(weights_bf16, hidden_state)
Weight Streaming Competes With KV Prefetch for the Same HBM Write Port
Here is the tension nobody talks about enough: weight streaming and KV cache prefetch both write to HBM. They share the same memory controllers and the same write bandwidth. If weight DMA and KV DMA are scheduled naively, they can overlap in the worst possible way — both saturating the HBM write path during the same GPU compute phase — causing stalls that eliminate the benefit of prefetch on either path.
The right architecture treats HBM write bandwidth as an explicit resource with a scheduler. The weight streaming engine and the KV cache manager submit DMA requests to a shared bandwidth arbitrator, which:
- Tracks available HBM write bandwidth in real time (measured via hardware performance counters)
- Prioritizes KV prefetch for sessions with imminent next-token generation over weight prefetch
- Schedules weight DMA during compute-bound FFN phases (high arithmetic intensity = low HBM read pressure)
- Avoids DMA during memory-bound attention phases (long sequence = high HBM read pressure)
A weight streaming implementation that does not model the GPU's memory access pattern can easily make decode latency worse than static HBM residency, by issuing weight DMA at exactly the wrong moment and creating HBM contention spikes. The bandwidth arbitrator is not optional. It is what makes streaming work.
Sparse Expert Activation Creates a Weight Residency Prediction Problem
Everything above assumes a dense model where all weights are accessed at every layer. Mixture-of-Experts changes the problem fundamentally: in a 256-expert model (like DeepSeek-V3), each token activates 8 experts per FFN layer. The other 248 experts' weights are never read. This creates a new opportunity: only prefetch the experts that will actually be activated.
This is weight streaming's killer application. For a 671B-parameter MoE model with 37B active parameters per token, the FFN weight read per decode step is not 671B × 2 bytes = 1.34 TB. It is 37B × 2 bytes = 74 GB — only the active experts. But to achieve this, you need to know which experts will be activated before you issue the DMA.
The router logit prediction problem: can you predict which experts a token will route to, before running the full forward pass that produces the routing decision? Partly yes. Several properties help:
- Expert locality: Tokens of similar semantic type tend to route to similar expert subsets across multiple decode steps. A coding token is likely to route to coding-specialized experts on the next step too.
- Historical trace: The router's decisions for the last 8–16 tokens are known and can be used to predict the next step's routing via a lightweight secondary model or frequency table.
- Prefetch tolerance: Expert weights are large enough (~4 GB per expert for a 30B-per-expert MoE) that even a rough prediction (6/8 correct) meaningfully reduces stalls compared to reactive loading.
This is why the MoE router history signal appears in the predictive weight orchestration essay — it is not just a curiosity, it is the signal that makes expert-granular weight streaming viable.
What Changes in the Serving Stack
Implementing weight streaming end-to-end requires changes at several layers of the serving stack. Here is a concrete inventory:
| Layer | Current State | Weight-Streaming State |
|---|---|---|
| Weight storage | BF16 tensors in HBM, static | Quantized (INT4/FP8) in tiered store, metadata-annotated |
| Layer execution | GEMV on pre-resident weights | GEMV + inline dequantization from streaming buffer |
| DMA engine | Used for KV prefetch only | Shared arbitration: weight stream + KV prefetch |
| HBM allocation | Static model partition + dynamic KV | Fluid: HBM shared between hot weights and hot KV |
| Inference framework | Weight loading at startup | Layer-granular fetch, cancel, and re-prioritize APIs |
| MoE routing | Router executes, then experts load | Router prediction → expert prefetch → verify-and-correct |
The hardest change is the HBM allocation model. Current systems partition HBM statically: N GB for model weights, M GB for KV cache, with a hard boundary between them. Weight streaming requires a fluid partition that can shift the boundary in real time as request load changes. A serving system with 10 long-context requests should hold fewer weight layers in HBM (more KV needed) than one with 100 short-context requests (KV pressure low, keep more weights hot). The memory scheduler needs to manage this boundary.
HBM4 Changes the Math, But Doesn't Solve the Problem
HBM4, expected in H200-successor products around 2026–2027, targets approximately 2 TB/s per stack — roughly 2× the HBM3e bandwidth. This will halve weight-read latency for the same model and batch size, moving the B=1 decode latency from ~41 ms to ~20 ms for a 70B BF16 model.
That is genuinely useful. But it does not change the fundamental arithmetic: decode will still be memory-bandwidth-bound for batch sizes below the ridge point, and the ridge point will still require 100+ concurrent tokens. The weight bandwidth problem shrinks by 2×; it does not go away. And models are getting larger. An expected 400B-parameter frontier MoE model with 80B active parameters hits the same wall at different scale.
The real unlock is the combination: HBM4 bandwidth × INT4 quantization × weight streaming with expert prediction. Together they can compress the effective weight-read time by 8–12× relative to BF16 static residency on HBM3e. That is the regime where decode latency for B=1 drops below 5 ms for a 70B-equivalent model — approaching the latency profile of a much smaller, less capable model.
Weights are not infrastructure. They are data. Once you treat them that way — with residency policies, prefetch schedules, and quantization as a bandwidth contract — the serving stack becomes significantly more capable at the same hardware cost.