MANISH AI
Inference Infrastructure Series Part 5 of 6
Memory Architecture · Placement Theory

Host RAM vs HBM for Inference: What Really Lives Where

Device-local KV, host spill, pinned buffers, DMA paths, and the physical constraints that determine what your serving stack should put where — and why getting it wrong costs you 10× on tail latency.

HBM · DDR5 PCIe · NVLink DMA · pinned CPU-only serving GPU spill paths NUMA
Quick Reference — Bandwidth
On-die SRAM~10+ TB/s
HBM3e (H200)~4.8 TB/s
HBM3 (H100)~3.35 TB/s
DDR5 (host)~100–150 GB/s
PCIe 5.0 ×16~128 GB/s
NVLink 4.0~900 GB/s
PCIe 4.0 ×16~64 GB/s
NVMe (PCIe 5)~14 GB/s

Most discussions of LLM inference treat memory as "GPU memory" and "not GPU memory," as if the boundary were obvious and the decision trivial. It is neither. The placement of KV state, model weights, activations, staging buffers, and control metadata across the available memory hierarchy determines your actual achievable throughput, your tail latency under load, and whether your serving stack degrades gracefully or falls off a cliff when capacity is reached.

This post goes through each category of data that lives in a serving runtime, explains what physical properties determine where it should live, and gives concrete guidance for both GPU-based and CPU-only deployments.

The single most important principle: data should live as close to the hardware that consumes it as possible, constrained by capacity. The entire art of memory placement is managing the tension between that principle and the fact that the fastest memory is also the smallest.

1 · The physical memory landscape

Full memory hierarchy — GPU inference server
Full memory hierarchy for GPU inference server Concentric zones from innermost on-die SRAM out to NVMe, with bandwidth and capacity annotations NVMe SSD ~4–14 GB/s · TBs · deep spill, checkpoint, cold KV archive System RAM (host DDR5) ~100–150 GB/s · 128–2048 GB · KV spill, staging, CPU model runtime ─── PCIe 5.0 / NVLink 4.0 ─── GPU HBM (VRAM) 1–4.8 TB/s · 16–192 GB · model weights, KV pool, activations, workspace GPU L2 cache ~6–50 MB per GPU · ~10+ TB/s effective · recently used weight tiles, KV tiles On-die SRAM · per-SM shared memory · ~10+ TB/s · active attention tiles, flash-attn working set ~10 GB/s ~128 GB/s ~3.35 TB/s ~8 TB/s ~10+ TB/s Each layer is ~10–30× slower than the one inside it. Crossing layer boundaries is the dominant cost driver.

The bandwidth gap between layers is the central constraint. HBM to on-die SRAM is fast enough that modern attention kernels barely notice it. HBM to host RAM over PCIe is ~30× slower. Host RAM to NVMe is ~10× slower still. Every time data crosses one of these boundaries, that is real latency and throughput cost paid by your users.

Memory tierBandwidthCapacity rangeLatencyRole
On-die SRAM
~10+ TB/s
10–256 MB~1 ns flash-attn tiles
HBM3e (H200)
4.8 TB/s
80–141 GB~100 ns weights + KV
HBM3 (H100)
3.35 TB/s
40–80 GB~100 ns weights + KV
DDR5 host RAM
~100 GB/s
128 GB – 2 TB~60 ns KV spill staging
PCIe 5.0 ×16
~128 GB/s
interconnect~2–5 µs H→D transfer
NVMe SSD
~14 GB/s
1–100 TB~100 µs cold archive

2 · What belongs in HBM

HBM is the home for data that the GPU needs to access at high throughput during every decode step. The governing rule: if it is on the critical path of a forward pass, it belongs in HBM.

Model weights

All transformer layer weights (QKV projections, FFN, norms, embeddings). Always in HBM for any inference that fits. For a 70B fp16 model that's ~140 GB — may require sharding across multiple GPUs.

Active KV cache

The hot set of KV blocks for currently running sequences. The goal is to keep all active request KV state in HBM throughout decode. Every block evicted to host RAM costs a PCIe retrieval on the next decode step.

Hot prefix blocks

Shared prefix blocks with high reference counts. These serve hundreds of concurrent requests and should be the last things evicted from HBM, not the first.

Activation workspace

Intermediate activations during a forward pass — relatively small (O(batch × seq × hidden)), reused within a step, then discarded. Must be in HBM but doesn't need to persist across steps.

Attention workspace

Flash attention or equivalent requires a scratch buffer proportional to seq length and head count. Typically pre-allocated at startup as part of the HBM workspace.

Sampler state

Logit buffers, softmax scratch space, top-k/top-p buffers. Small relative to weights and KV, but must be GPU-local for fast per-step sampling.

How much HBM does each component use?

# HBM budget breakdown — 80 GB H100, Llama-3 70B fp16
model_weights  = 70e9 * 2 / 1e9   # = 140 GB  ← needs 2× H100 or quantization

# With INT4 quantization (4-bit weights):
model_weights  = 70e9 * 0.5 / 1e9 # = 35 GB   ← fits single H100
kv_cache       = 80 - 35 - 4      # = ~41 GB  available for KV

# KV capacity at 41 GB (80 layers, 8 KV heads GQA, d=128, fp16):
tokens_per_GB  = 1e9 / (80 * 8 * 128 * 2 * 2) # = ~3,052 tokens/GB
total_tokens   = 41 * 3052                      # = ~125K tokens of KV
# That's ~31 concurrent 4K-token sessions
# Or ~10 concurrent 12K-token sessions

Quantization is the highest-leverage HBM optimization. Halving weight precision (fp16→int8) frees roughly as much HBM as the entire KV budget on a well-utilized server — effectively doubling your token capacity without adding hardware.

3 · What belongs in host RAM

Host RAM is large, slow relative to HBM, and accessible to both the CPU and the GPU (via DMA/PCIe). Its role in inference serving is as a capacity extension and staging ground, not as a primary compute substrate.

Evicted KV blocks (spill)

KV blocks that were in HBM and got evicted to make room for hotter state. Should be in pinned memory so the GPU can DMA-retrieve them without the OS swapping them out mid-transfer. Retrieval costs a PCIe round-trip (~2–5 µs latency, plus transfer time at ~100 GB/s).

Host-side staging buffers

Transfer buffers for streaming KV blocks between GPU and CPU. Pinned (locked in physical RAM) so the DMA engine can address them safely. Sized to pipeline transfers without stalling the decode loop.

Cold prefix block store

Prefix blocks that are valid and potentially reusable but not currently referenced. Kept in host RAM rather than HBM when HBM is under pressure. Promoted back to HBM on a cache hit if the block is referenced again.

CPU model runtime (CPU-only)

In CPU-only serving, model weights and active KV state both live in system RAM. This is discussed in detail in section 6.

The pinned memory requirement is critical: KV spill buffers in host RAM must be pinned (mlock'd or allocated with cudaMallocHost). If the OS migrates or swaps a spill buffer's physical pages between when the GPU initiates a DMA transfer and when it completes, the transfer reads stale or garbage data. Pinning prevents this. It is not optional for any buffer that a GPU or other DMA-capable device will access directly.

4 · The placement decision matrix

Here is a systematic summary of what lives where, under what conditions:

Data category On-die SRAM HBM Host RAM NVMe
Model weights (fp16) Primary If CPU-only
Model weights (quantized) Primary If CPU-only Cold load
Active KV (hot, running seqs) ✗ (too small) Primary Spill only
Shared prefix KV (high refcount) Priority hold If evicted
Evicted KV (spill) Evicted from Pinned buffers Cold sessions
Attention tiles (flash-attn) Working set Source/dest
Intermediate activations Partial Primary
Logit / sampler buffers Primary
DMA transfer buffers Destination Pinned
Session checkpoints Hot Cold archive

5 · DMA paths and pinned buffers in detail

When a KV block is evicted from HBM and later needed again, the recovery path involves a DMA transfer from host RAM back to HBM. Understanding this path is essential to knowing when host spill is practical and when it is too expensive.

KV spill and recovery — DMA pipeline
KV spill and recovery DMA pipeline Timeline showing eviction from HBM to pinned host RAM, and later DMA retrieval back to HBM, with timing annotations Eviction decision ~1 µs T=0 DMA: HBM → pinned host RAM block_MB / 100 GB/s ~0.3 µs/MB Resident in host RAM (pinned, stable addr) Cache hit triggered ~1–2 µs DMA: pinned → HBM restore ~0.3 µs/MB + latency seq stalls until done If host buffer is NOT pinned: OS may migrate physical page between DMA initiation and completion → data corruption or invalid read

The transfer time for a 1 MB KV block (a 32-token block across 80 layers at fp16) over PCIe 5.0 is roughly 10 µs. Restoring a 4-token sequence requires 4 blocks = 40 µs of transfer time, plus transfer latency and scheduling overhead. For 100-token sequences the cost is ~1 ms — noticeable in tail latency. For 4000-token sequences it is ~40 ms — a severe stall.

6 · CPU-only inference: a different world

In CPU-only inference, there is no GPU memory hierarchy. Model weights and KV cache both live in system RAM, accessed through the CPU's L1→L2→L3→DRAM hierarchy. The rules change substantially.

CPU-only inference — memory architecture
CPU-only inference memory architecture CPU cache hierarchy with KV cache and weights in DRAM, showing NUMA domains NUMA Node 0 (Socket 0) CPU Cores 0-23 L1/L2 per core L3 shared (64 MB) Local DDR5 ~100 GB/s · 256 GB KV cache model weights activations, workspace NUMA Node 1 (Socket 1) CPU Cores 24-47 L1/L2 per core L3 shared (64 MB) Local DDR5 ~100 GB/s · 256 GB KV cache model weights QPI ~50 GB/s CPU-only: KV and weights compete for the same DRAM bandwidth — NUMA placement is critical

The defining characteristic of CPU-only inference is that KV cache and model weights compete for the same DRAM bandwidth. There is no separate high-bandwidth pool for the computation-intensive weights. Every decode step reads weights and KV from the same ~100 GB/s DDR5 bus. This makes memory bandwidth — not compute — the dominant bottleneck.

CPU-only placement rules differ from GPU serving

7 · NUMA and memory locality

NUMA (Non-Uniform Memory Access) is the architecture of all multi-socket servers: each CPU socket has fast access to its local DRAM and slower access to remote sockets' DRAM via a high-speed interconnect (Intel QPI/UPI, AMD Infinity Fabric). The latency and bandwidth asymmetry is significant: remote access is typically 1.5–2× slower.

NUMA mistakes (common)
  • Threads on socket 0 accessing KV allocated on socket 1's DRAM
  • Memory allocated before thread pinning — falls to first-touch DRAM which may be remote
  • Default OS allocator scattering KV blocks across sockets for "balance"
  • GPU PCIe connected to socket 1 but transfer buffers allocated on socket 0
NUMA best practice
  • Use numactl --membind=N or numa_alloc_onnode() for KV arena
  • Pin decode threads to the socket local to both KV memory and (if applicable) GPU PCIe root
  • Allocate memory from the same thread that will access it (first-touch policy)
  • Verify NUMA topology with numastat under load — remote access > 5% is a red flag

8 · Practical placement rules

A condensed decision guide for placement under the most common configurations:

The single most common mistake: evicting KV blocks to unpinned host RAM. This appears to work in testing (small loads, no concurrent access), fails silently under concurrent load (OS migration), and blows up under memory pressure (OS swap). Pinned spill buffers are not a premature optimization — they are a correctness requirement.

The complete picture

This post closes out the foundational layer of the series. From KV as a memory system, through paged allocation and continuous batching, through eviction policy theory, and now to physical placement — the arc is complete.

The inference serving systems that get this right maintain high GPU utilization, low tail latency, and predictable behavior under load. Those that get it wrong exhibit mysterious degradation at concurrency, catastrophic tail latency during spill, and fragility under memory pressure — not because the model is slow, but because the memory system underneath it is misbehaving.

Understanding where things physically live, why they live there, and what happens when they are in the wrong place is the foundation for reasoning clearly about inference performance.

← Previous · Part 4
KV Cache Eviction Is Becoming the New OS Scheduler
Coming · Part 6 →
Disaggregated Prefill and Decode: Splitting the Inference Pipeline