Most discussions of LLM inference treat memory as "GPU memory" and "not GPU memory," as if the boundary were obvious and the decision trivial. It is neither. The placement of KV state, model weights, activations, staging buffers, and control metadata across the available memory hierarchy determines your actual achievable throughput, your tail latency under load, and whether your serving stack degrades gracefully or falls off a cliff when capacity is reached.
This post goes through each category of data that lives in a serving runtime, explains what physical properties determine where it should live, and gives concrete guidance for both GPU-based and CPU-only deployments.
The single most important principle: data should live as close to the hardware that consumes it as possible, constrained by capacity. The entire art of memory placement is managing the tension between that principle and the fact that the fastest memory is also the smallest.
1 · The physical memory landscape
The bandwidth gap between layers is the central constraint. HBM to on-die SRAM is fast enough that modern attention kernels barely notice it. HBM to host RAM over PCIe is ~30× slower. Host RAM to NVMe is ~10× slower still. Every time data crosses one of these boundaries, that is real latency and throughput cost paid by your users.
| Memory tier | Bandwidth | Capacity range | Latency | Role |
|---|---|---|---|---|
| On-die SRAM | 10–256 MB | ~1 ns | flash-attn tiles | |
| HBM3e (H200) | 80–141 GB | ~100 ns | weights + KV | |
| HBM3 (H100) | 40–80 GB | ~100 ns | weights + KV | |
| DDR5 host RAM | 128 GB – 2 TB | ~60 ns | KV spill staging | |
| PCIe 5.0 ×16 | interconnect | ~2–5 µs | H→D transfer | |
| NVMe SSD | 1–100 TB | ~100 µs | cold archive |
2 · What belongs in HBM
HBM is the home for data that the GPU needs to access at high throughput during every decode step. The governing rule: if it is on the critical path of a forward pass, it belongs in HBM.
All transformer layer weights (QKV projections, FFN, norms, embeddings). Always in HBM for any inference that fits. For a 70B fp16 model that's ~140 GB — may require sharding across multiple GPUs.
The hot set of KV blocks for currently running sequences. The goal is to keep all active request KV state in HBM throughout decode. Every block evicted to host RAM costs a PCIe retrieval on the next decode step.
Shared prefix blocks with high reference counts. These serve hundreds of concurrent requests and should be the last things evicted from HBM, not the first.
Intermediate activations during a forward pass — relatively small (O(batch × seq × hidden)), reused within a step, then discarded. Must be in HBM but doesn't need to persist across steps.
Flash attention or equivalent requires a scratch buffer proportional to seq length and head count. Typically pre-allocated at startup as part of the HBM workspace.
Logit buffers, softmax scratch space, top-k/top-p buffers. Small relative to weights and KV, but must be GPU-local for fast per-step sampling.
How much HBM does each component use?
# HBM budget breakdown — 80 GB H100, Llama-3 70B fp16
model_weights = 70e9 * 2 / 1e9 # = 140 GB ← needs 2× H100 or quantization
# With INT4 quantization (4-bit weights):
model_weights = 70e9 * 0.5 / 1e9 # = 35 GB ← fits single H100
kv_cache = 80 - 35 - 4 # = ~41 GB available for KV
# KV capacity at 41 GB (80 layers, 8 KV heads GQA, d=128, fp16):
tokens_per_GB = 1e9 / (80 * 8 * 128 * 2 * 2) # = ~3,052 tokens/GB
total_tokens = 41 * 3052 # = ~125K tokens of KV
# That's ~31 concurrent 4K-token sessions
# Or ~10 concurrent 12K-token sessions
Quantization is the highest-leverage HBM optimization. Halving weight precision (fp16→int8) frees roughly as much HBM as the entire KV budget on a well-utilized server — effectively doubling your token capacity without adding hardware.
3 · What belongs in host RAM
Host RAM is large, slow relative to HBM, and accessible to both the CPU and the GPU (via DMA/PCIe). Its role in inference serving is as a capacity extension and staging ground, not as a primary compute substrate.
KV blocks that were in HBM and got evicted to make room for hotter state. Should be in pinned memory so the GPU can DMA-retrieve them without the OS swapping them out mid-transfer. Retrieval costs a PCIe round-trip (~2–5 µs latency, plus transfer time at ~100 GB/s).
Transfer buffers for streaming KV blocks between GPU and CPU. Pinned (locked in physical RAM) so the DMA engine can address them safely. Sized to pipeline transfers without stalling the decode loop.
Prefix blocks that are valid and potentially reusable but not currently referenced. Kept in host RAM rather than HBM when HBM is under pressure. Promoted back to HBM on a cache hit if the block is referenced again.
In CPU-only serving, model weights and active KV state both live in system RAM. This is discussed in detail in section 6.
The pinned memory requirement is critical: KV spill buffers in host RAM must be pinned (mlock'd or allocated with cudaMallocHost). If the OS migrates or swaps a spill buffer's physical pages between when the GPU initiates a DMA transfer and when it completes, the transfer reads stale or garbage data. Pinning prevents this. It is not optional for any buffer that a GPU or other DMA-capable device will access directly.
4 · The placement decision matrix
Here is a systematic summary of what lives where, under what conditions:
| Data category | On-die SRAM | HBM | Host RAM | NVMe |
|---|---|---|---|---|
| Model weights (fp16) | ✗ | Primary | If CPU-only | ✗ |
| Model weights (quantized) | ✗ | Primary | If CPU-only | Cold load |
| Active KV (hot, running seqs) | ✗ (too small) | Primary | Spill only | ✗ |
| Shared prefix KV (high refcount) | ✗ | Priority hold | If evicted | ✗ |
| Evicted KV (spill) | ✗ | Evicted from | Pinned buffers | Cold sessions |
| Attention tiles (flash-attn) | Working set | Source/dest | ✗ | ✗ |
| Intermediate activations | Partial | Primary | ✗ | ✗ |
| Logit / sampler buffers | ✗ | Primary | ✗ | ✗ |
| DMA transfer buffers | ✗ | Destination | Pinned | ✗ |
| Session checkpoints | ✗ | ✗ | Hot | Cold archive |
5 · DMA paths and pinned buffers in detail
When a KV block is evicted from HBM and later needed again, the recovery path involves a DMA transfer from host RAM back to HBM. Understanding this path is essential to knowing when host spill is practical and when it is too expensive.
The transfer time for a 1 MB KV block (a 32-token block across 80 layers at fp16) over PCIe 5.0 is roughly 10 µs. Restoring a 4-token sequence requires 4 blocks = 40 µs of transfer time, plus transfer latency and scheduling overhead. For 100-token sequences the cost is ~1 ms — noticeable in tail latency. For 4000-token sequences it is ~40 ms — a severe stall.
6 · CPU-only inference: a different world
In CPU-only inference, there is no GPU memory hierarchy. Model weights and KV cache both live in system RAM, accessed through the CPU's L1→L2→L3→DRAM hierarchy. The rules change substantially.
The defining characteristic of CPU-only inference is that KV cache and model weights compete for the same DRAM bandwidth. There is no separate high-bandwidth pool for the computation-intensive weights. Every decode step reads weights and KV from the same ~100 GB/s DDR5 bus. This makes memory bandwidth — not compute — the dominant bottleneck.
CPU-only placement rules differ from GPU serving
- 1Pin inference threads to one NUMA node. Crossing NUMA boundaries over QPI (~50 GB/s) cuts effective bandwidth in half. The KV cache and weights assigned to a socket must be accessed by that socket's cores.
- 2Use huge pages (2 MB) for KV and weight arenas. With large working sets that far exceed L3, TLB pressure is the second dominant bottleneck after raw bandwidth. 2 MB pages reduce TLB miss rate by ~500× compared to 4 KB pages for the same working set.
- 3Pre-fault all arenas at startup. Demand paging creates latency spikes during the first access of new pages.
mlock()the KV arena and weight buffers to ensure they are resident and pre-faulted before serving begins. - 4Quantize weights aggressively. INT8 weights halve the weight footprint and therefore halve the bandwidth consumed by weight reads — the single biggest reduction in memory pressure available.
- 5Separate weight traffic from KV traffic where possible. On multi-socket systems, keep weights on one socket's DRAM and active KV on the other. Parallel reads from two independent DRAM controllers nearly double effective throughput.
7 · NUMA and memory locality
NUMA (Non-Uniform Memory Access) is the architecture of all multi-socket servers: each CPU socket has fast access to its local DRAM and slower access to remote sockets' DRAM via a high-speed interconnect (Intel QPI/UPI, AMD Infinity Fabric). The latency and bandwidth asymmetry is significant: remote access is typically 1.5–2× slower.
- Threads on socket 0 accessing KV allocated on socket 1's DRAM
- Memory allocated before thread pinning — falls to first-touch DRAM which may be remote
- Default OS allocator scattering KV blocks across sockets for "balance"
- GPU PCIe connected to socket 1 but transfer buffers allocated on socket 0
- Use
numactl --membind=Nornuma_alloc_onnode()for KV arena - Pin decode threads to the socket local to both KV memory and (if applicable) GPU PCIe root
- Allocate memory from the same thread that will access it (first-touch policy)
- Verify NUMA topology with
numastatunder load — remote access > 5% is a red flag
8 · Practical placement rules
A condensed decision guide for placement under the most common configurations:
- 1Model weights: always HBM for GPU serving. Always local NUMA DRAM + quantized for CPU serving. Never NVMe during inference.
- 2Active KV (running sequences): always HBM. Never tolerate active KV living in host RAM unless PCIe bandwidth is underutilized and the sequence is unlikely to need frequent attention access.
- 3Shared high-refcount prefix blocks: last to evict from HBM. Assign explicit "do not evict" status if reference count exceeds a threshold (e.g. > 50 active references).
- 4Evicted KV (spill): pinned host RAM, allocated with
cudaMallocHostor equivalent. Never unpinned. Never swap-eligible via OS. - 5DMA transfer buffers: always pinned. Pre-allocated at startup in a ring buffer sized for your expected peak spill rate.
- 6CPU-only KV and weights: local NUMA node, huge pages, pre-faulted, quantized weights, separate arena per resource type to avoid false sharing.
- 7NVMe: only for session archiving, long-session cold storage, or model checkpoint loading — never on the hot serving path.
The single most common mistake: evicting KV blocks to unpinned host RAM. This appears to work in testing (small loads, no concurrent access), fails silently under concurrent load (OS migration), and blows up under memory pressure (OS swap). Pinned spill buffers are not a premature optimization — they are a correctness requirement.
The complete picture
This post closes out the foundational layer of the series. From KV as a memory system, through paged allocation and continuous batching, through eviction policy theory, and now to physical placement — the arc is complete.
The inference serving systems that get this right maintain high GPU utilization, low tail latency, and predictable behavior under load. Those that get it wrong exhibit mysterious degradation at concurrency, catastrophic tail latency during spill, and fragility under memory pressure — not because the model is slow, but because the memory system underneath it is misbehaving.
Understanding where things physically live, why they live there, and what happens when they are in the wrong place is the foundation for reasoning clearly about inference performance.