1. The fundamental resource mismatch
Every LLM inference request has two distinct phases. The prefill phase processes the full input prompt — potentially thousands of tokens — in a single forward pass. The decode phase generates output tokens one at a time, autoregressively, until the sequence is complete.
These two phases have opposite resource requirements, and the difference is not subtle:
Prefill: compute-bound
- Processes all input tokens simultaneously — large matrix-matrix multiplies
- Arithmetic intensity: ~100–1000 FLOP/byte depending on sequence length
- Wants: high TFLOP/s, wide tensor cores, FLOP density
- Bottleneck: compute throughput, not memory bandwidth
- Duration: milliseconds to seconds (proportional to prompt length)
- GPU utilization: very high — tensors fill every core
Decode: memory-bandwidth-bound
- Generates one token at a time — matrix-vector multiplies
- Arithmetic intensity: ~1–10 FLOP/byte (reads all weights for one token)
- Wants: high HBM bandwidth, large KV cache capacity, low latency
- Bottleneck: memory bandwidth, not compute
- Duration: seconds to minutes (proportional to output length)
- GPU utilization: low — compute cores sit idle waiting on memory
When prefill and decode run on the same GPU, neither runs at its hardware-optimal point. During prefill, the GPU's HBM bandwidth sits partially idle while compute cores are busy. During decode, the GPU's tensor cores sit mostly idle while HBM bandwidth is the critical resource. This is not a software problem that can be optimized away — it is a hardware profile mismatch baked into the workload structure.
2. What disaggregation means in practice
Disaggregated inference separates the serving fleet into two pools:
- Prefill nodes — optimized for compute throughput. Could be the same H100/H200 GPUs used for training, or could be specialized ASICs with more TFLOP/s and less HBM. Each prefill node processes one or more prompts, generates the full KV cache, and then transfers the KV cache to a decode node.
- Decode nodes — optimized for memory bandwidth and KV cache capacity. These could be the same H100s, or future memory-bandwidth-optimized hardware. Each decode node receives KV caches from one or more prefill nodes and then handles the full autoregressive generation loop.
The system does not send the request from client to prefill to decode linearly for each token. The prefill node does its work once, transfers the KV state, and then hands off. The decode node owns the session from that point until the sequence completes.
3. KV transfer: the hard part nobody talks about enough
The compute argument for disaggregation is clean. The memory problem is messier. Once prefill and decode are on different machines, the full KV cache for every completed prefill must travel from the prefill node to the decode node before decoding can begin. This transfer is not free.
Transfer volume is large
For a 70B model on a 128K-token prompt, the KV cache is roughly:
layers × 2 × num_heads × head_dim × seq_len × dtype_bytes = 80 × 2 × 64 × 128 × 128,000 × 2 (BF16) ≈ 42 GB
Forty-two gigabytes must travel from the prefill node to the decode node before decoding can start. At NVLink speeds (~200 GB/s per direction between directly connected GPUs), that is about 210 milliseconds of transfer time. At PCIe bandwidth (~20–30 GB/s realistic throughput across switches), it can exceed 1.5 seconds. That delay directly adds to time-to-first-token latency.
Transfer must be atomic from the decode node's perspective
When a KV cache transfer is in flight, the decode node cannot begin processing — it needs the complete KV state before it can compute attention for the first decode step. The transfer is a synchronization point. This means overlapping transfers with decode computation requires careful orchestration: the decode node can begin processing prefix-attended queries only once the layers whose KV data is fully received.
Layer-by-layer pipelining can overlap transfer with early decode layers
One optimization is to pipeline the KV transfer layer by layer: send the KV cache for layer 0 first, start decode computation for layer 0 while layer 1's KV data is still in flight, and so on. This requires the decode node to begin execution in a partial-KV-state, which complicates the residency management but can reduce effective time-to-first-token by hiding the transfer latency behind compute.
4. KV residency management at the decode node
Once the decode node receives a KV cache and begins serving the session, it faces the same residency management problem as any long-running decode session — but with a new constraint: the KV cache arrived from outside, was transferred in full, and cannot be regenerated locally without re-running the prefill node. That changes the eviction calculus.
In co-located serving, a KV block that is evicted from HBM can be regenerated by running prefill on the relevant context prefix — expensive but possible. In disaggregated serving, evicting a KV block from the decode node's HBM means either:
- Transfer it to host DRAM on the decode node — preserving state but adding latency on re-access.
- Transfer it back to the prefill pool — preserving state but adding round-trip transfer cost.
- Drop it and re-request regeneration from the prefill node — full prefill cost, plus another transfer.
Option 3 is catastrophically expensive. Option 2 is architecturally complex. Option 1 is the practical baseline, but it requires the decode node to have a well-managed DRAM tier specifically for in-flight session KV state — not just cold archived state.
5. The router and scheduler: harder than it looks
Disaggregation replaces one hard scheduling problem (how do we batch prefill and decode together on the same GPU?) with two easier problems connected by a harder coordination problem (how do we route requests across two pools with different resource profiles, connected by a transfer bottleneck?).
The routing decisions are non-trivial:
Prefill pool admission
The router must decide which prefill node to assign each incoming request. This depends on available compute capacity, expected prefill duration (proportional to prompt length), and whether the prefill node can reach a decode node with available KV capacity via a fast transfer path. The router must avoid assigning a prefill to a node whose nearest decode nodes are full.
Decode node selection and KV reservation
Before prefill starts, the router should reserve KV capacity on a decode node — because if no decode node has room for the incoming KV cache when prefill completes, the prefill node must either wait (tying up its GPU) or drop the KV (wasting all the prefill compute). Reservation before prefill avoids both failure modes.
Transfer topology awareness
NVLink between directly connected GPUs is much faster than PCIe across switch hops. The router should prefer prefill-decode pairs on the same NVLink fabric. When topology constraints force slower transfer paths, the router must account for the higher transfer latency in its TTFT estimates.
Decode load balancing across sessions
Decode nodes have different KV pressure depending on the active session mix — long contexts fill HBM faster than short ones. The router must balance incoming sessions across decode nodes to avoid HBM saturation on any single node, which would force eviction and damage latency for all sessions on that node.
Prefill pool sizing
The right ratio of prefill nodes to decode nodes depends on the workload's prompt/output length ratio and the cluster's SLA targets. A workload with very long prompts and short outputs needs more prefill capacity. A workload with short prompts and long outputs needs more decode capacity. The router must be able to operate over a dynamically re-balanced pool ratio.
6. Chunked prefill and its interaction with disaggregation
Chunked prefill — processing the prompt in fixed-size chunks rather than all at once — is often discussed as a technique for reducing prefill latency variance in co-located serving. In disaggregated serving, it has a second purpose: it enables the KV cache transfer to begin during prefill, rather than waiting for prefill to complete.
The mechanism: as each chunk of the prompt is processed, the KV state for those tokens is complete and can be transmitted to the decode node immediately. By the time the last chunk of prefill completes, the decode node already holds most of the KV cache — only the KV state for the last chunk needs to arrive before decoding can start.
This is the most aggressive latency optimization available in disaggregated serving, but it creates coordination requirements: the decode node must implement partial-KV-cache attention, the transfer must be layer-aligned, and the prefill node must handle chunk completion signaling.
7. Memory policy implications for disaggregated serving
Disaggregation changes the assumptions that existing KV memory policies were designed around. Most production KV cache management — including PagedAttention, confidence-gated residency, and region-based tiering — was designed for co-located serving where the KV cache is generated and consumed on the same GPU. In disaggregated serving, several of those assumptions break:
Assumption broken: regeneration is cheap
Co-located serving can evict KV and recompute. Disaggregated serving must transfer instead of recompute. Eviction penalty goes up by 10–100×.
Assumption broken: KV locality
Co-located serving has KV cache locality within one GPU's HBM. Disaggregated serving has KV state arriving from a remote node with transfer latency.
Assumption broken: admission = generation
Co-located: KV enters the cache when prefill runs on that GPU. Disaggregated: KV arrives as a transfer from a different machine at an unpredictable time.
These broken assumptions require a different memory policy framework at decode nodes. Specifically:
- More conservative eviction thresholds. Since eviction triggers transfer rather than recompute, the cost of a wrong eviction is much higher. The hysteresis band discussed in the HBM Fragmentation Guard essay should be widened for disaggregated decode nodes.
- Transfer-aware admission control. When a KV transfer is in flight but the decode node's HBM is nearly full, the node must decide whether to evict an existing session's KV (risking that session's latency) or delay the incoming transfer (adding latency to the new session). This is a new scheduling decision that co-located serving never faces.
- DRAM tier as a first-class cache, not an overflow buffer. Since eviction to DRAM is far preferred over eviction to re-transfer, decode nodes should maintain a larger and more actively managed DRAM KV tier than co-located serving would suggest.
8. Prefix caching and disaggregation: a complicated relationship
Prefix caching — reusing KV state from a prior session's prompt when a new session shares the same prefix — is a major throughput optimization for serving systems with many requests sharing system prompts, instruction templates, or common context prefixes.
In co-located serving, prefix caching is relatively clean: if the KV for prefix X is already in HBM, skip the prefill for that prefix entirely. In disaggregated serving, the prefix caching question becomes: where is the cached prefix KV? It could be on any prefill node (if a prefill node cached it) or on any decode node (if a decode node received and retained it after a prior session ended).
This creates a routing constraint: if a new request shares a prefix with a session that was recently served by prefill node P3 and decode node D2, the most efficient path is to route the new request through the same P3-D2 pair so the prefix KV is already local. But that requires the router to maintain prefix-to-node affinity, which conflicts with pure load balancing.
9. What disaggregation tells us about future hardware
The architectural separation of prefill and decode in software has implications for what hardware should look like. Once the two workloads are cleanly separated, it becomes possible to build hardware optimized specifically for each:
Prefill-optimized hardware
- More FLOP/s, wider tensor cores, higher compute density
- Less HBM — only weights + current-batch activations needed
- High-bandwidth transfer interface to decode nodes
- Example direction: training-class accelerators repurposed for prefill
Decode-optimized hardware
- More HBM bandwidth and capacity per unit cost
- Fewer TFLOP/s — matrix-vector multiplies at one token per step
- Large, efficient KV cache management
- Example direction: memory-bandwidth-optimized ASICs, CXL-attached memory expansions
This is the same insight that motivated NVMe SSDs for cold storage and HBM for hot compute — different media optimized for different workload profiles, connected by explicit movement. Disaggregated inference is applying the same principle to GPU compute and memory bandwidth.
10. The systems view: what disaggregation actually requires
Disaggregated inference is a genuinely better architecture for large-scale LLM serving. The theoretical benefits are real: better hardware utilization, lower TTFT for long prompts, higher throughput per dollar, and a cleaner path to purpose-built hardware.
But implementing it correctly requires getting several memory system problems right simultaneously:
- KV transfer infrastructure that minimizes latency, handles topology heterogeneity (NVLink vs. PCIe vs. RDMA), supports compression, and enables chunked streaming delivery.
- Decode node memory management with conservative eviction policies, transfer-aware admission control, and an actively managed DRAM tier for in-flight session KV state.
- Router and scheduler that maintains prefix affinity, reserves decode capacity before prefill completes, balances decode KV pressure across nodes, and handles transfer failures gracefully.
- Prefix cache coherence across a distributed fleet of prefill and decode nodes with heterogeneous KV residency.