AI Infrastructure Weight Serving Decode Optimization

Weight Streaming: Why Your Model's Weights Are the Other Decode Bottleneck

The KV cache gets all the attention. But for large models at low batch sizes, loading transformer layer weights from HBM dominates decode latency just as much. This essay derives the weight bandwidth problem from first principles, shows where current systems leave performance on the table, and describes what a weight-streaming architecture looks like.

MANISH AI··14 min read
140 GBWeight footprint of a 70B BF16 model — must cross HBM bus every decode step
~1 FLOP/BArithmetic intensity during decode — memory-bound by a factor of 150×
~53 msWeight-read time at 3.35 TB/s for a single 70B BF16 decode step, batch=1
3–7×Theoretical decode throughput gain from weight streaming with quantization
01 — The Framing

Everything We Talk About Is KV Cache. Here Is What We Are Not Talking About.

The standard narrative for LLM decode optimization goes like this: autoregressive generation is memory-bound, the dominant memory pressure is the KV cache, and the solution space is KV compression, GQA, paging, tiering, and eviction policy. That narrative is correct. But it is incomplete in a way that becomes consequential at large model scale and low batch sizes.

Consider what happens during a single decode step for a 70B BF16 model at batch size 1. The GPU must: load every transformer layer's weight matrices from HBM into the register file and L1 cache; compute a matrix-vector product with the current token's hidden state; apply the attention mechanism to a KV cache of some size; and emit a logit distribution for sampling.

The weight matrices dominate. At BF16, a 70B-parameter model occupies approximately 140 GB of HBM. Every single decode step requires reading all 140 GB across the HBM bus. At H200's 3.35 TB/s bandwidth, that takes 41 milliseconds of pure memory time — before any arithmetic, before any KV cache reads, before any software overhead.

📐
The Arithmetic

For a single decode step: FLOPs ≈ 2 × parameters = 2 × 70B = 140 GFLOPs. Bytes read ≈ 2 × parameters (BF16) = 140 GB weights + KV reads. Arithmetic intensity = FLOPs / Bytes ≈ 1 FLOP/byte. H200 ridge point: ~148 FLOPs/byte. The GPU is operating at less than 1% of its compute efficiency. It is a 989 TFLOP/s machine being used as a 3.35 TB/s memory bus.

This is the weight bandwidth problem. It is not new — it has been known since the first generation of large-model serving work. But its implications are underappreciated, particularly for the question of how to architect weight residency across a multi-tier memory hierarchy.

At batch size 1, loading model weights from HBM costs the same as loading the KV cache for a 32K-token context. Neither is free. Neither can be ignored.

02 — First Principles

Where the Bytes Come From: A Layer-by-Layer Breakdown

To understand the weight bandwidth problem precisely, we need to account for where the bytes actually come from in a transformer layer. Take a 70B-class model with the LLaMA-3 architecture as a concrete reference: 80 layers, model dimension 8192, GQA with 64 query heads and 8 KV heads, SwiGLU FFN with intermediate dimension ~28672.

Per transformer layer, the weight matrices are:

MatrixShapeBF16 SizePurpose
W_q8192 × 8192128 MBQuery projection
W_k8192 × 102416 MBKey projection (8 KV heads)
W_v8192 × 102416 MBValue projection
W_o8192 × 8192128 MBOutput projection
W_gate8192 × 28672448 MBSwiGLU gate branch
W_up8192 × 28672448 MBSwiGLU up branch
W_down28672 × 8192448 MBFFN down projection
Per-layer total~1.63 GBDominates FFN (84% of per-layer weight)

Across 80 layers: ~130 GB. Plus embeddings, LM head, and norms: ~140 GB total. This entire volume must cross the HBM bus every decode step, because the GPU's on-chip SRAM (50 MB L2 + 256 KB register file per SM) is far too small to hold more than a tiny fraction of it.

The FFN Dominates

Notice the asymmetry above: the FFN weights (W_gate + W_up + W_down) account for roughly 1.34 GB per layer — 84% of per-layer weight. The attention weights (W_q + W_k + W_v + W_o) account for the remaining 16%. This is not an accident. SwiGLU FFNs with a 3.5× expansion ratio are parameter-heavy by design; the tradeoff is quality per FLOP.

For weight streaming optimization, this asymmetry matters. Compressing or quantizing the FFN matrices delivers more bandwidth relief per unit of implementation complexity than compressing the attention matrices. A 4-bit FFN with a 16-bit attention module is a reasonable design point. The reverse is less compelling.

03 — Batching as the Partial Cure

Why Batch Size Is the Free Variable That Hides the Problem

The textbook answer to the weight bandwidth problem is: increase batch size. If you process B tokens simultaneously, you pay the 140 GB weight-read cost once but generate B output tokens — amortizing the bandwidth cost by B. The arithmetic intensity scales as:

# Arithmetic intensity scales linearly with batch size
intensity(B) = FLOPs / Bytes = (2 × params × B) / (2 × params + B × KV_footprint)
# For large B, → FLOPs/KV_bytes ≈ B × model_dim (compute-bound)
# For B=1, → ~1 FLOP/byte (deep memory bound)
# Ridge point for H200: ~148 FLOPs/byte, requires B ≈ 148 for full utilization
B_ridge = 148 * (2 * params) / (2 * params)  # ≈ 148 tokens simultaneously

This analysis implies that a batch of 148 tokens would keep the H200 tensor cores fully occupied during decode. That sounds achievable. But it requires 148 concurrent sessions, each contributing exactly one decode step to the same batch. In practice, request arrivals are Poisson-distributed, context lengths vary, and prefill/decode mixing creates contention. Achieving B=148 consistently is harder than the math suggests.

Batch SizeArithmetic IntensityDecode Latency (70B)GPU EfficiencyNotes
1~1 FLOP/B~41 ms0.7%Single user, terrible efficiency
8~8 FLOP/B~42 ms5%8× throughput, same latency
32~32 FLOP/B~43 ms22%Still memory-bound
128~128 FLOP/B~46 ms86%Near ridge point
256~148 FLOP/B~65 ms~100%Compute-bound, latency rises

The punchline is that for batch sizes below about 64, the decode step is dominated by weight reads, not by arithmetic. Every optimization effort focused on KV cache for these workloads is addressing the second bottleneck, not the first.

04 — Weight Quantization as Bandwidth Relief

INT4 Weights Are Not About Accuracy. They Are About Bandwidth.

This is the most important reframe in this essay. The motivation for weight quantization in LLM serving is widely described in terms of model compression, memory footprint, or cost reduction. These are real benefits. But the primary mechanism of action for decode throughput improvement is bandwidth reduction.

INT4 weights occupy half the bytes of BF16. Reading 140 GB of BF16 weights takes 41 ms at 3.35 TB/s. Reading the same model in INT4 takes 21 ms. That is a 2× decode throughput improvement at B=1, which degrades gradually as batch size increases (because the benefit amortizes). At B=32, the improvement is still ~1.8×. At B=128, it is ~1.2×. The entire "LLM.int8 is lossless, GPTQ is nearly lossless" literature is really an argument about whether we can capture the bandwidth benefit without sacrificing model quality — not an argument about compression for its own sake.

💡
The Quantization Reframe

Weight quantization from BF16 to INT4 is a 4× bandwidth multiplier on the weight-read path. At B=1 and H200 bandwidth, this is equivalent to upgrading from an H200 (3.35 TB/s) to a hypothetical 13.4 TB/s memory system that does not exist. No hardware upgrade achieves this. Quantization is the only practical path to weight-bandwidth relief at current silicon economics.

The Accuracy-Bandwidth Tradeoff Landscape

FormatBytes/paramWeight BW (70B, H200)Decode ms (B=1)Relative Quality
BF162140 GB~41 msBaseline
FP8170 GB~21 ms~0.1% degradation
INT8 (W8A16)170 GB~21 msNearly lossless
INT4 (GPTQ)0.535 GB~10 ms0.5–2% perplexity rise
INT4 (AWQ)0.535 GB~10 ms<0.5% perplexity rise
INT2 (AQLM)0.2517.5 GB~5 ms2–5% perplexity rise
05 — Weight Streaming Architecture

What a Weight-Streaming System Actually Looks Like

Current serving systems load the full model into HBM at startup and keep it resident for the lifetime of the server process. This is optimal when HBM is large enough to hold both the model weights and the KV cache for the concurrent request load — which was true for 7B–13B models on 80 GB H100s, but is no longer true for 70B+ models at long context or high concurrency.

A weight-streaming architecture abandons the assumption of full static residency. Instead, it treats weight tensors as data objects with known access patterns, and manages their placement across the memory hierarchy just as the KV cache manager manages KV pages.

Fig 1 — Weight Streaming: Tier Placement and Access Patterns
GPU HBM 80–141 GB 3.35 TB/s Host DRAM 512 GB – 2 TB ~50 GB/s (PCIe P2P) NVMe SSD 10–100 TB ~12 GB/s Tensor Cores GEMV: weights × hidden ~41 ms (BF16) ~10 ms (INT4) Weight Streaming Engine Layer scheduler · DMA orchestrator · Prefetch predictor Quantization decompressor · Residency tracker Bandwidth budget: shared with KV prefetch Weight Residency Policy Hot layers in HBM (embed, first 8, last 8) Mid layers: prefetch 2 ahead (DRAM staging → HBM DMA) Cold model variants: NVMe (draft models, LoRA adapters)
Weight streaming architecture. The streaming engine manages weight placement across HBM, DRAM, and NVMe tiers, prefetching layer weights ahead of the decode loop and decompressing quantized weights on-the-fly as they enter the GPU. The bandwidth budget is shared with the KV prefetch path.

Layer Ordering Gives You Perfect Prefetch

Unlike the KV cache — whose access patterns are partially unpredictable (request length, tool calls, speculative rollbacks) — weight access during decode is completely deterministic. Layer 0 is always followed by layer 1. Layer 1 is always followed by layer 2. The weight streaming engine knows the entire access schedule in advance and can issue DMA prefetches with arbitrarily long lookahead.

This is the key advantage weight streaming has over KV streaming: prediction is trivial. The only complexity is bandwidth budget management — ensuring that weight DMA transfers do not starve the KV prefetch path that is also competing for HBM write bandwidth.

# Pseudocode: weight streaming loop
for decode_step in range(max_tokens):
    for layer_idx in range(num_layers):
        # Fire DMA for layer+2 while computing layer
        if layer_idx + 2 < num_layers:
            dma.prefetch_async(
                src=weight_store.get(layer_idx + 2),
                dst=hbm_staging_buffer,
                priority=WEIGHT_STREAM,
                bandwidth_cap=bw_budget * 0.4  # leave 60% for KV
            )
        # Wait only if prefetch hasn't landed yet (rare)
        weights = hbm_staging_buffer.wait(layer_idx)
        # Decompress INT4 → BF16 in register during GEMV setup
        weights_bf16 = dequantize(weights, scale=scale_factors[layer_idx])
        # Execute GEMV
        hidden = matmul(weights_bf16, hidden_state)
06 — The Bandwidth Budget Problem

Weight Streaming Competes With KV Prefetch for the Same HBM Write Port

Here is the tension nobody talks about enough: weight streaming and KV cache prefetch both write to HBM. They share the same memory controllers and the same write bandwidth. If weight DMA and KV DMA are scheduled naively, they can overlap in the worst possible way — both saturating the HBM write path during the same GPU compute phase — causing stalls that eliminate the benefit of prefetch on either path.

The right architecture treats HBM write bandwidth as an explicit resource with a scheduler. The weight streaming engine and the KV cache manager submit DMA requests to a shared bandwidth arbitrator, which:

  • Tracks available HBM write bandwidth in real time (measured via hardware performance counters)
  • Prioritizes KV prefetch for sessions with imminent next-token generation over weight prefetch
  • Schedules weight DMA during compute-bound FFN phases (high arithmetic intensity = low HBM read pressure)
  • Avoids DMA during memory-bound attention phases (long sequence = high HBM read pressure)
⚠️
The Naive Implementation Hazard

A weight streaming implementation that does not model the GPU's memory access pattern can easily make decode latency worse than static HBM residency, by issuing weight DMA at exactly the wrong moment and creating HBM contention spikes. The bandwidth arbitrator is not optional. It is what makes streaming work.

07 — MoE Makes This Much More Interesting

Sparse Expert Activation Creates a Weight Residency Prediction Problem

Everything above assumes a dense model where all weights are accessed at every layer. Mixture-of-Experts changes the problem fundamentally: in a 256-expert model (like DeepSeek-V3), each token activates 8 experts per FFN layer. The other 248 experts' weights are never read. This creates a new opportunity: only prefetch the experts that will actually be activated.

This is weight streaming's killer application. For a 671B-parameter MoE model with 37B active parameters per token, the FFN weight read per decode step is not 671B × 2 bytes = 1.34 TB. It is 37B × 2 bytes = 74 GB — only the active experts. But to achieve this, you need to know which experts will be activated before you issue the DMA.

The router logit prediction problem: can you predict which experts a token will route to, before running the full forward pass that produces the routing decision? Partly yes. Several properties help:

  • Expert locality: Tokens of similar semantic type tend to route to similar expert subsets across multiple decode steps. A coding token is likely to route to coding-specialized experts on the next step too.
  • Historical trace: The router's decisions for the last 8–16 tokens are known and can be used to predict the next step's routing via a lightweight secondary model or frequency table.
  • Prefetch tolerance: Expert weights are large enough (~4 GB per expert for a 30B-per-expert MoE) that even a rough prediction (6/8 correct) meaningfully reduces stalls compared to reactive loading.

This is why the MoE router history signal appears in the predictive weight orchestration essay — it is not just a curiosity, it is the signal that makes expert-granular weight streaming viable.

08 — Practical Architecture

What Changes in the Serving Stack

Implementing weight streaming end-to-end requires changes at several layers of the serving stack. Here is a concrete inventory:

LayerCurrent StateWeight-Streaming State
Weight storageBF16 tensors in HBM, staticQuantized (INT4/FP8) in tiered store, metadata-annotated
Layer executionGEMV on pre-resident weightsGEMV + inline dequantization from streaming buffer
DMA engineUsed for KV prefetch onlyShared arbitration: weight stream + KV prefetch
HBM allocationStatic model partition + dynamic KVFluid: HBM shared between hot weights and hot KV
Inference frameworkWeight loading at startupLayer-granular fetch, cancel, and re-prioritize APIs
MoE routingRouter executes, then experts loadRouter prediction → expert prefetch → verify-and-correct

The hardest change is the HBM allocation model. Current systems partition HBM statically: N GB for model weights, M GB for KV cache, with a hard boundary between them. Weight streaming requires a fluid partition that can shift the boundary in real time as request load changes. A serving system with 10 long-context requests should hold fewer weight layers in HBM (more KV needed) than one with 100 short-context requests (KV pressure low, keep more weights hot). The memory scheduler needs to manage this boundary.

09 — Where This Is Going

HBM4 Changes the Math, But Doesn't Solve the Problem

HBM4, expected in H200-successor products around 2026–2027, targets approximately 2 TB/s per stack — roughly 2× the HBM3e bandwidth. This will halve weight-read latency for the same model and batch size, moving the B=1 decode latency from ~41 ms to ~20 ms for a 70B BF16 model.

That is genuinely useful. But it does not change the fundamental arithmetic: decode will still be memory-bandwidth-bound for batch sizes below the ridge point, and the ridge point will still require 100+ concurrent tokens. The weight bandwidth problem shrinks by 2×; it does not go away. And models are getting larger. An expected 400B-parameter frontier MoE model with 80B active parameters hits the same wall at different scale.

The real unlock is the combination: HBM4 bandwidth × INT4 quantization × weight streaming with expert prediction. Together they can compress the effective weight-read time by 8–12× relative to BF16 static residency on HBM3e. That is the regime where decode latency for B=1 drops below 5 ms for a 70B-equivalent model — approaching the latency profile of a much smaller, less capable model.

Weights are not infrastructure. They are data. Once you treat them that way — with residency policies, prefetch schedules, and quantization as a bandwidth contract — the serving stack becomes significantly more capable at the same hardware cost.