Inference Architecture · Disaggregated Serving · Memory Systems

Prefill-Decode Disaggregation:
Why the Next Big Inference Architecture
Splits the Job in Two

Prefill and decode have opposite resource profiles. Prefill is compute-bound — it processes the entire input context in a single large matrix multiply and wants high FLOP density. Decode is memory-bandwidth-bound — it generates one token per step and spends most of its time reading weights and KV cache from HBM. Running them on the same GPU imposes a hardware compromise that satisfies neither. Disaggregation separates them — and exposes a set of memory orchestration problems that will define the next generation of inference infrastructure.

By Manish KL · April 2026 · ~15 min read
Core thesis

Prefill and decode are different workloads running on the same hardware. Disaggregation fixes the hardware mismatch — but creates a new problem: KV cache transfer, residency management at disaggregated decode nodes, and the scheduler that routes requests between the two pools become the dominant engineering challenges. The compute win is straightforward. The memory system complexity is where the real work is.

~10–100×
Compute intensity ratio between prefill and decode (FLOP/byte)
~3.35 TB/s
HBM bandwidth H100 — the resource decode is bottlenecked on
50–200 GB/s
Typical NVLink / PCIe KV transfer bandwidth between disaggregated nodes
1–10 ms
KV transfer latency between prefill and decode nodes at production batch sizes

1. The fundamental resource mismatch

Every LLM inference request has two distinct phases. The prefill phase processes the full input prompt — potentially thousands of tokens — in a single forward pass. The decode phase generates output tokens one at a time, autoregressively, until the sequence is complete.

These two phases have opposite resource requirements, and the difference is not subtle:

Prefill: compute-bound

  • Processes all input tokens simultaneously — large matrix-matrix multiplies
  • Arithmetic intensity: ~100–1000 FLOP/byte depending on sequence length
  • Wants: high TFLOP/s, wide tensor cores, FLOP density
  • Bottleneck: compute throughput, not memory bandwidth
  • Duration: milliseconds to seconds (proportional to prompt length)
  • GPU utilization: very high — tensors fill every core

Decode: memory-bandwidth-bound

  • Generates one token at a time — matrix-vector multiplies
  • Arithmetic intensity: ~1–10 FLOP/byte (reads all weights for one token)
  • Wants: high HBM bandwidth, large KV cache capacity, low latency
  • Bottleneck: memory bandwidth, not compute
  • Duration: seconds to minutes (proportional to output length)
  • GPU utilization: low — compute cores sit idle waiting on memory
Prefill wants more arithmetic throughput. Decode wants more memory bandwidth. Optimizing one hardware for both is a hardware design compromise that wastes resources on each phase separately.

When prefill and decode run on the same GPU, neither runs at its hardware-optimal point. During prefill, the GPU's HBM bandwidth sits partially idle while compute cores are busy. During decode, the GPU's tensor cores sit mostly idle while HBM bandwidth is the critical resource. This is not a software problem that can be optimized away — it is a hardware profile mismatch baked into the workload structure.

Fig 1 — Resource utilization: prefill vs. decode on the same GPU
Bar chart comparing compute utilization and HBM bandwidth utilization during prefill and decode phases on a shared GPU PREFILL PHASE Compute ~95% HBM BW ~40% Compute is the bottleneck HBM bandwidth is underused DECODE PHASE Compute ~20% HBM BW ~95% HBM bandwidth is the bottleneck Compute cores sit mostly idle Co-located serving: every GPU is suboptimal for both phases simultaneously Prefill wants compute-optimized hardware. Decode wants memory-bandwidth-optimized hardware.
The resource mismatch between prefill and decode phases is fundamental. A GPU that is well-utilized during prefill (high compute, moderate HBM) is poorly utilized during decode (low compute, near-saturated HBM), and vice versa. This mismatch is the motivation for disaggregation: run each phase on hardware matched to its resource profile.

2. What disaggregation means in practice

Disaggregated inference separates the serving fleet into two pools:

  • Prefill nodes — optimized for compute throughput. Could be the same H100/H200 GPUs used for training, or could be specialized ASICs with more TFLOP/s and less HBM. Each prefill node processes one or more prompts, generates the full KV cache, and then transfers the KV cache to a decode node.
  • Decode nodes — optimized for memory bandwidth and KV cache capacity. These could be the same H100s, or future memory-bandwidth-optimized hardware. Each decode node receives KV caches from one or more prefill nodes and then handles the full autoregressive generation loop.

The system does not send the request from client to prefill to decode linearly for each token. The prefill node does its work once, transfers the KV state, and then hands off. The decode node owns the session from that point until the sequence completes.

The key architectural invariant: after disaggregation, no GPU is doing both prefill and decode simultaneously. Each GPU specializes. The coupling between the two pools is exactly the KV cache transfer that happens once per request, at the transition from prefill completion to decode start.
Fig 2 — Disaggregated inference system architecture
System architecture diagram of disaggregated inference showing client, router, prefill pool, KV transfer, and decode pool Client request Router scheduler PREFILL POOL GPU-P1 GPU-P2 GPU-P3 compute-optimized KV Transfer NVLink / RDMA once per request KV state DECODE POOL GPU-D1 GPU-D2 GPU-D3 GPU-D4 memory-bw-optimized Token stream (response)
Disaggregated inference routes each request through a prefill node (which processes the prompt and generates the full KV cache) and then transfers the KV cache to a decode node (which owns the autoregressive generation loop). The transfer happens once per request. The decode node handles the session from KV receipt until sequence completion. The router matches requests to available prefill and decode capacity.

3. KV transfer: the hard part nobody talks about enough

The compute argument for disaggregation is clean. The memory problem is messier. Once prefill and decode are on different machines, the full KV cache for every completed prefill must travel from the prefill node to the decode node before decoding can begin. This transfer is not free.

Transfer volume is large

For a 70B model on a 128K-token prompt, the KV cache is roughly:

layers × 2 × num_heads × head_dim × seq_len × dtype_bytes
= 80 × 2 × 64 × 128 × 128,000 × 2  (BF16)
≈ 42 GB

Forty-two gigabytes must travel from the prefill node to the decode node before decoding can start. At NVLink speeds (~200 GB/s per direction between directly connected GPUs), that is about 210 milliseconds of transfer time. At PCIe bandwidth (~20–30 GB/s realistic throughput across switches), it can exceed 1.5 seconds. That delay directly adds to time-to-first-token latency.

The transfer cost is not background work: the user experiences the KV transfer latency directly, as added time before the first token appears. Minimizing transfer time — through compression, pipelining, direct NVLink topology, or chunked prefill — is a first-order user experience optimization, not a back-of-the-house implementation detail.

Transfer must be atomic from the decode node's perspective

When a KV cache transfer is in flight, the decode node cannot begin processing — it needs the complete KV state before it can compute attention for the first decode step. The transfer is a synchronization point. This means overlapping transfers with decode computation requires careful orchestration: the decode node can begin processing prefix-attended queries only once the layers whose KV data is fully received.

Layer-by-layer pipelining can overlap transfer with early decode layers

One optimization is to pipeline the KV transfer layer by layer: send the KV cache for layer 0 first, start decode computation for layer 0 while layer 1's KV data is still in flight, and so on. This requires the decode node to begin execution in a partial-KV-state, which complicates the residency management but can reduce effective time-to-first-token by hiding the transfer latency behind compute.

Transfer strategy Latency impact Complexity
Monolithic (wait for all layers)
Full transfer time added to TTFT. Simplest to implement.
Low — no partial-state decode required
Layer-pipelined transfer
Overlap transfer with decode layers. Hides ~40–60% of transfer latency behind compute.
Medium — requires partial-layer-ready decode logic
Compressed + pipelined
KV quantization (FP8 or INT4) reduces transfer volume 2–4×. Combine with pipelining.
High — compression/decompression on transfer path, quality tradeoff
Chunked prefill + streaming
Start sending KV chunks during prefill itself. Decode begins as soon as first chunk arrives.
Highest — requires tight prefill/decode coordination, chunked attention

4. KV residency management at the decode node

Once the decode node receives a KV cache and begins serving the session, it faces the same residency management problem as any long-running decode session — but with a new constraint: the KV cache arrived from outside, was transferred in full, and cannot be regenerated locally without re-running the prefill node. That changes the eviction calculus.

In co-located serving, a KV block that is evicted from HBM can be regenerated by running prefill on the relevant context prefix — expensive but possible. In disaggregated serving, evicting a KV block from the decode node's HBM means either:

  1. Transfer it to host DRAM on the decode node — preserving state but adding latency on re-access.
  2. Transfer it back to the prefill pool — preserving state but adding round-trip transfer cost.
  3. Drop it and re-request regeneration from the prefill node — full prefill cost, plus another transfer.

Option 3 is catastrophically expensive. Option 2 is architecturally complex. Option 1 is the practical baseline, but it requires the decode node to have a well-managed DRAM tier specifically for in-flight session KV state — not just cold archived state.

Disaggregation changes the eviction penalty: in co-located serving, KV eviction triggers a recompute. In disaggregated serving, KV eviction triggers a transfer (at best) or a full recompute-plus-transfer (at worst). The eviction penalty goes up, which means the eviction policy must be more conservative — and the decode node must have more capacity headroom than co-located serving would suggest.
Fig 3 — KV state lifecycle in disaggregated serving
State lifecycle of KV cache across disaggregated prefill and decode nodes PREFILL NODE KV generated in prefill HBM Staged for transfer KV no longer needed after transfer ACK NVLink/RDMA ~42 GB DECODE NODE HBM resident active decode KV evict to DRAM (preserve state) DRAM (decode node) drop KV → re-prefill + re-transfer ⚡ expensive Autoregressive decode loop generate t₁, t₂, ... tₙ until EOS or max len KV grows each step session owns full state
KV state is generated on the prefill node, transferred once to the decode node, and then managed entirely by the decode node for the session's lifetime. If HBM pressure forces eviction, the decode node demotes to local DRAM — losing the regeneration-free fallback that co-located serving provides. Dropping KV state entirely forces a full re-prefill plus re-transfer, which is the most expensive failure mode in disaggregated serving.

5. The router and scheduler: harder than it looks

Disaggregation replaces one hard scheduling problem (how do we batch prefill and decode together on the same GPU?) with two easier problems connected by a harder coordination problem (how do we route requests across two pools with different resource profiles, connected by a transfer bottleneck?).

The routing decisions are non-trivial:

01
Prefill pool admission

The router must decide which prefill node to assign each incoming request. This depends on available compute capacity, expected prefill duration (proportional to prompt length), and whether the prefill node can reach a decode node with available KV capacity via a fast transfer path. The router must avoid assigning a prefill to a node whose nearest decode nodes are full.

02
Decode node selection and KV reservation

Before prefill starts, the router should reserve KV capacity on a decode node — because if no decode node has room for the incoming KV cache when prefill completes, the prefill node must either wait (tying up its GPU) or drop the KV (wasting all the prefill compute). Reservation before prefill avoids both failure modes.

03
Transfer topology awareness

NVLink between directly connected GPUs is much faster than PCIe across switch hops. The router should prefer prefill-decode pairs on the same NVLink fabric. When topology constraints force slower transfer paths, the router must account for the higher transfer latency in its TTFT estimates.

04
Decode load balancing across sessions

Decode nodes have different KV pressure depending on the active session mix — long contexts fill HBM faster than short ones. The router must balance incoming sessions across decode nodes to avoid HBM saturation on any single node, which would force eviction and damage latency for all sessions on that node.

05
Prefill pool sizing

The right ratio of prefill nodes to decode nodes depends on the workload's prompt/output length ratio and the cluster's SLA targets. A workload with very long prompts and short outputs needs more prefill capacity. A workload with short prompts and long outputs needs more decode capacity. The router must be able to operate over a dynamically re-balanced pool ratio.

6. Chunked prefill and its interaction with disaggregation

Chunked prefill — processing the prompt in fixed-size chunks rather than all at once — is often discussed as a technique for reducing prefill latency variance in co-located serving. In disaggregated serving, it has a second purpose: it enables the KV cache transfer to begin during prefill, rather than waiting for prefill to complete.

The mechanism: as each chunk of the prompt is processed, the KV state for those tokens is complete and can be transmitted to the decode node immediately. By the time the last chunk of prefill completes, the decode node already holds most of the KV cache — only the KV state for the last chunk needs to arrive before decoding can start.

This is the most aggressive latency optimization available in disaggregated serving, but it creates coordination requirements: the decode node must implement partial-KV-cache attention, the transfer must be layer-aligned, and the prefill node must handle chunk completion signaling.

Chunked prefill + disaggregation = streaming KV delivery: this combination treats the KV cache as a stream that the decode node begins consuming before it is fully produced. It is the same conceptual pattern as streaming prefetch in memory hierarchies — start using data as it arrives rather than waiting for the full transfer to complete. The implementation complexity is significant but the latency benefit is large, especially for long prompts over slow transfer paths.

7. Memory policy implications for disaggregated serving

Disaggregation changes the assumptions that existing KV memory policies were designed around. Most production KV cache management — including PagedAttention, confidence-gated residency, and region-based tiering — was designed for co-located serving where the KV cache is generated and consumed on the same GPU. In disaggregated serving, several of those assumptions break:

Assumption broken: regeneration is cheap

Co-located serving can evict KV and recompute. Disaggregated serving must transfer instead of recompute. Eviction penalty goes up by 10–100×.

Assumption broken: KV locality

Co-located serving has KV cache locality within one GPU's HBM. Disaggregated serving has KV state arriving from a remote node with transfer latency.

Assumption broken: admission = generation

Co-located: KV enters the cache when prefill runs on that GPU. Disaggregated: KV arrives as a transfer from a different machine at an unpredictable time.

These broken assumptions require a different memory policy framework at decode nodes. Specifically:

  • More conservative eviction thresholds. Since eviction triggers transfer rather than recompute, the cost of a wrong eviction is much higher. The hysteresis band discussed in the HBM Fragmentation Guard essay should be widened for disaggregated decode nodes.
  • Transfer-aware admission control. When a KV transfer is in flight but the decode node's HBM is nearly full, the node must decide whether to evict an existing session's KV (risking that session's latency) or delay the incoming transfer (adding latency to the new session). This is a new scheduling decision that co-located serving never faces.
  • DRAM tier as a first-class cache, not an overflow buffer. Since eviction to DRAM is far preferred over eviction to re-transfer, decode nodes should maintain a larger and more actively managed DRAM KV tier than co-located serving would suggest.

8. Prefix caching and disaggregation: a complicated relationship

Prefix caching — reusing KV state from a prior session's prompt when a new session shares the same prefix — is a major throughput optimization for serving systems with many requests sharing system prompts, instruction templates, or common context prefixes.

In co-located serving, prefix caching is relatively clean: if the KV for prefix X is already in HBM, skip the prefill for that prefix entirely. In disaggregated serving, the prefix caching question becomes: where is the cached prefix KV? It could be on any prefill node (if a prefill node cached it) or on any decode node (if a decode node received and retained it after a prior session ended).

This creates a routing constraint: if a new request shares a prefix with a session that was recently served by prefill node P3 and decode node D2, the most efficient path is to route the new request through the same P3-D2 pair so the prefix KV is already local. But that requires the router to maintain prefix-to-node affinity, which conflicts with pure load balancing.

Prefix caching in disaggregated serving is a distributed cache coherence problem. The cached KV state is spread across prefill and decode nodes, transfer paths have cost, and routing decisions must account for cache locality. This is a harder version of the same problem that distributed database sharding has been solving for decades.

9. What disaggregation tells us about future hardware

The architectural separation of prefill and decode in software has implications for what hardware should look like. Once the two workloads are cleanly separated, it becomes possible to build hardware optimized specifically for each:

Prefill-optimized hardware

  • More FLOP/s, wider tensor cores, higher compute density
  • Less HBM — only weights + current-batch activations needed
  • High-bandwidth transfer interface to decode nodes
  • Example direction: training-class accelerators repurposed for prefill

Decode-optimized hardware

  • More HBM bandwidth and capacity per unit cost
  • Fewer TFLOP/s — matrix-vector multiplies at one token per step
  • Large, efficient KV cache management
  • Example direction: memory-bandwidth-optimized ASICs, CXL-attached memory expansions

This is the same insight that motivated NVMe SSDs for cold storage and HBM for hot compute — different media optimized for different workload profiles, connected by explicit movement. Disaggregated inference is applying the same principle to GPU compute and memory bandwidth.

10. The systems view: what disaggregation actually requires

Disaggregated inference is a genuinely better architecture for large-scale LLM serving. The theoretical benefits are real: better hardware utilization, lower TTFT for long prompts, higher throughput per dollar, and a cleaner path to purpose-built hardware.

But implementing it correctly requires getting several memory system problems right simultaneously:

  • KV transfer infrastructure that minimizes latency, handles topology heterogeneity (NVLink vs. PCIe vs. RDMA), supports compression, and enables chunked streaming delivery.
  • Decode node memory management with conservative eviction policies, transfer-aware admission control, and an actively managed DRAM tier for in-flight session KV state.
  • Router and scheduler that maintains prefix affinity, reserves decode capacity before prefill completes, balances decode KV pressure across nodes, and handles transfer failures gracefully.
  • Prefix cache coherence across a distributed fleet of prefill and decode nodes with heterogeneous KV residency.
The pattern this series keeps finding: every major inference architecture advance — speculative decoding, sparse attention, KV tiering, and now prefill-decode disaggregation — produces a compute optimization that converts cleanly into a memory orchestration problem. Compute wins are finite and predictable. Memory orchestration wins compound because they determine how efficiently the compute wins can be applied across a real production workload.
Connection to earlier essays: disaggregated serving depends on all the memory policy machinery this series has analyzed: confident-gated KV residency (HBM Fragmentation Guard), explicit movement and tier orchestration (Memory Scheduler), bounce-buffer elimination (Bounce Buffers essay), RDMA-based KV transfer (RDMA in the Age of AI), and workload-aware KV routing (Seam Orchestrator). Disaggregation is not a new problem — it is the composition of all those problems at fleet scale.