AI Inference · GPU Memory · Systems Architecture

When VRAM Stops Being a Weight Warehouse

A systems primer on HBM, KV cache, PagedAttention, weight offload — and why the real end state isn't smarter offloading, it's treating GPU memory as a bounded execution working set that weights pass through, not live in.

Manish KL· April 2026· ~18 min read· Systems Primer
H100 SXM · 80 GB HBM3 — Live Pressure Simulation 80 GB total
Model weights
KV cache
Activations + buffers
Free
Select a scenario to see the pressure

For a long time the dominant mental model for AI inference was simple: load the model weights into GPU memory, keep them there, and run inference as fast as the hardware allows. That model worked when models were smaller and GPU memory was mostly treated as a place to store the model. Modern inference systems expose a different reality.

Part 1 — What people mean by VRAM and HBM

In casual conversation, people say "VRAM" to mean "GPU memory." That's fine informally, but for systems work the distinction matters.

VRAM is the general term for any memory attached to a GPU. HBM — High Bandwidth Memory — is the specific technology used in modern AI accelerators. It's not just fast memory; it's memory stacked in 3D dies directly adjacent to the GPU die, connected via a wide silicon interposer rather than a conventional PCB trace. That physical proximity is what makes it extraordinary.

3.35
TB/s memory bandwidth on H100 SXM (HBM3)
80
GB HBM capacity on H100 SXM — the "constraint"
~48
GB/s PCIe Gen5 x16 — the offload ceiling

HBM matters not just because it's memory, but because it can sustain the data rate that modern GPU tensor cores need. A100 and H100 tensor cores can theoretically perform thousands of TFLOP/s — but only if they're continuously fed with data. If weights and activations arrive too slowly, cores stall and that compute goes to waste. The memory system is the bottleneck at scale.

Fig. 01GPU Memory Hierarchy — Bandwidth and Latency by Tier
L1 / Shared ~19 TB/s · ~128 KB/SM HBM3 (VRAM) 3.35 TB/s · 80 GB NVLink / NVSwitch 900 GB/s · multi-GPU PCIe Gen5 / Host ~48–128 GB/s · TBs LATENCY / DISTANCE FROM COMPUTE → BANDWIDTH COMPARISON (log scale) ~19 TB/s (L1/Shared, on-chip) 3.35 TB/s (HBM3) 900 GB/s (NVLink) ~48 GB/s (PCIe Gen5) — the offload bottleneck THE BANDWIDTH CLIFF HBM delivers 3,350 GB/s. PCIe Gen5 delivers 48 GB/s. That's a 70× gap. Anything that forces data across PCIe during the hot path pays a severe tax in stall cycles.
The bandwidth cliff between HBM and PCIe is 70× on a modern system. Any system design that forces data across PCIe during the compute hot path pays that tax in GPU stall cycles.
When people say "the model has to fit in VRAM" — what they really mean is: the model has to fit in the GPU's highest-bandwidth memory tier, because that's the only tier that can feed the compute units without stalling them.

Part 2 — What actually lives in GPU memory

During inference, GPU memory is not consumed by one thing. It's shared across several major tenants — and they compete with each other for the same scarce bytes.

Fig. 02VRAM Tenancy on an H100 SXM (80 GB) — 70B Model at Scale
MEMORY PRESSURE BY SCENARIO (GB) 20 35 50 65 80 80+ 70B, 4K ctx batch=1 35 GB weights ~40 GB free 70B, 32K ctx batch=8 35 GB weights ~26 GB KV cache ~15 GB free 70B, 128K ctx batch=16 35 GB weights ~45 GB KV cache OOM → Streaming mode same 70B model 4 4↑ ~56 GB KV cache (room for 10× more context or concurrency) ~10 GB free Weights KV cache Buffers Free ↑ prefetch in-flight (streaming)
The weight footprint is static. The KV cache is dynamic and grows with context length, batch size, and concurrency. In streaming mode, weights occupy only 4–8 GB at any moment (active layer + prefetch), freeing the rest for KV cache. Numbers are illustrative approximations for a 70B BF16 model.

The three tenants and their characters

  • Weights — static, large (35 GB for 70B in BF16), reused every forward pass. They're the most evictable: read-only, fully predictable access pattern, layer-by-layer.
  • KV cache — dynamic, grows with context and concurrency, directly drives serving throughput. Losing KV cache means recomputation. It's the highest-value resident per byte.
  • Activations and buffers — ephemeral per token/step, generally small but non-negotiable. Workspace for the active forward pass.
The key insight: Weights are static and predictable. KV cache is dynamic and drives revenue (throughput, concurrency, context length). Any system that prioritizes weight residency over KV cache residency has its incentives backwards.

Part 3 — Why KV cache is the crux

The KV cache is the most misunderstood concept in LLM inference. People reach for it to explain memory pressure, but rarely work through the math to understand why it dominates at scale.

The mathematics of KV cache size

For a transformer model with the following parameters:

# KV cache size formula KV_bytes = seq_len × num_layers × 2 × num_heads × head_dim × bytes_per_element # For LLaMA-3 70B: 80 layers, 8 KV heads, head_dim=128, BF16 # At seq_len=32K, batch=1: KV = 32768 × 80 × 2 × 8 × 128 × 2 # = ~3.3 GB per request # batch=8, seq=32K → ~26 GB # batch=16, seq=128K → ~105 GB ← doesn't fit on a single H100
Fig. 03KV Cache Growth — Llama-3 70B, BF16, Varying Context and Batch
0 20 40 60 80 100+ 80 GB H100 limit 4K 8K 16K 32K 64K 128K Context Length (tokens) batch=1 batch=4 batch=8 ← hits limit OOM at 32K, batch=16 Y-axis: KV cache GB | Weights (35 GB) excluded — add to get total VRAM
KV cache grows linearly with both sequence length and batch size. At 128K context and batch=8, KV cache alone exceeds the H100's 80 GB capacity — before weights are even counted. Numbers are approximate for LLaMA-3 70B BF16 with GQA (8 KV heads).

This is the core tension: KV cache is unbounded and valuable. Weights are large and static. The naive approach — "fit everything in VRAM simultaneously" — simply doesn't scale to modern serving requirements.

"More free GPU memory often means more KV cache. More KV cache often means more throughput, more concurrency, longer context. In inference economics, KV cache space is revenue." — the hidden economic truth of LLM serving
§

Part 4 — What vLLM and PagedAttention actually solved

vLLM emerged from the observation that naive LLM serving wastes enormous GPU memory — not because there isn't enough, but because it's managed badly.

Before PagedAttention, serving frameworks pre-allocated a contiguous KV cache buffer for each request at its maximum possible length. If a request might reach 4K tokens, 4K tokens of HBM were reserved — even if the request only used 300. The rest was stranded. Under high concurrency, this fragmentation meant GPUs could be memory-bound even when utilization metrics looked fine.

Fig. 04PagedAttention — Virtual KV Cache Blocks vs. Contiguous Pre-allocation
NAIVE: Contiguous Pre-allocation Req A wasted Req B wasted Req C ~40% of reserved space wasted PAGEDATTENTION: Virtual Block Mapping Physical block pool (fixed-size pages) A1 B1 C1 A2 C2 B2 A3 C3 free free Virtual block tables (per request) Req A: vblk[0→A1, 1→A2, 2→A3] Req B: vblk[0→B1, 1→B2] Req C: vblk[0→C1, 1→C2, 2→C3] ~0% waste. Blocks allocated only when actually needed. Prefix sharing bonus: Multiple requests sharing a common prompt prefix can point to the same physical blocks.
PagedAttention uses virtual block tables to map logical KV positions to physical non-contiguous memory pages — eliminating pre-allocation waste. An important bonus: requests sharing a common prompt prefix (system prompt, few-shot examples) can share physical KV blocks, multiplying effective capacity.

PagedAttention matters here not just as a KV cache trick, but as a signal that the field was beginning to treat GPU memory as something to be orchestrated rather than simply allocated. It is the first major production-grade step away from "GPU memory as warehouse."

The conceptual shift vLLM made: GPU memory is not a fixed allocation. It's a pool of blocks to be dynamically managed according to request-level demand. The OS virtual memory analogy is intentional — and it points toward where the next steps go.

Part 5 — Weight offload: the right problem, imperfect solution

If weights consume the largest static slice of VRAM, the most obvious solution is: move them somewhere cheaper. Keep them in host RAM (CPU DRAM) and stream them to the GPU on demand. This is the intuition behind weight offload approaches used in frameworks like Hugging Face Accelerate, DeepSpeed Inference, and others.

Why it helps

Without offload

Weights stay in HBM

Full 35 GB of a 70B model in BF16 occupies HBM permanently. That leaves ~45 GB for KV cache — enough for moderate concurrency at moderate context length, but you hit the wall fast.

With naive offload

Weights move to host RAM

Weights live in CPU DRAM (~TBs available). GPU fetches each layer when needed. Full ~80 GB is theoretically available for KV cache. But now execution stalls waiting for layer weights to arrive over PCIe.

The real win of offload: if VRAM isn't consumed by weights, it can fund KV cache, longer context, and more concurrent requests. This is the genuine performance lever — but only if the offload is fast enough not to stall the GPU.

Why naive offload fails at scale

The fatal problem with simple offload is that it's reactive. The runtime discovers it needs layer N's weights, then issues a PCIe DMA to fetch them from host memory. While that transfer is in flight, the GPU's tensor cores sit idle. This is the worst possible outcome for an expensive accelerator.

35
GB of weights in a 70B BF16 model
~0.7s
Time to stream 35 GB over PCIe Gen5 at 48 GB/s
0
Useful tokens generated while the GPU waits

Part 6 — The bandwidth wall

Once you commit to moving weights off-chip, the bottleneck shifts from capacity to bandwidth and scheduling. Understanding this precisely is the key to building a system that actually works.

Fig. 05Transfer Time vs. Compute Time — Where the Stall Lives
LAYER WEIGHT SIZE VS. TRANSFER TIME (70B model, 80 layers) Scenario Layer weight size PCIe Gen5 transfer (48 GB/s) Compute time @ 2000 tok/s Overlap possible? 70B BF16, 1 layer ~437 MB ~9 ms ~0.5 ms / layer No — 18× too slow 70B INT4, 1 layer ~109 MB ~2.3 ms ~0.5 ms / layer Marginal — 4.5× too slow NVLink (900 GB/s), 1 layer ~437 MB ~0.48 ms ~0.5 ms / layer Yes — roughly matched PCIe Gen5 + INT4 + batching ~109 MB / layer ~2.3 ms ~2 ms (large batch) Roughly — with careful scheduling THE INSIGHT Streaming weights over PCIe only works if transfer time ≤ compute time for the preceding layer. Three levers: reduce weight size (quantization), increase batch size (compute time), start transfer early (prefetch scheduling).
The arithmetic of weight streaming is unforgiving. At BF16 over PCIe Gen5, transfer time is 18× longer than compute time per layer — there's no way to hide the stall without either quantizing, batching aggressively, or using NVLink. INT4 quantization and large batches together can bring the ratio close enough for prefetching to work.

This is the key constraint: the transfer budget must fit inside the compute window of the preceding layer. If it doesn't, the GPU stalls. The three independent levers that bring transfer time down are: quantization (reduces bytes), batching (increases compute time per layer pass), and prefetch scheduling (starts transfers before they're urgently needed).

The real constraint shift: once you decide to stream weights, the limiting resource is no longer memory capacity. It becomes the system's ability to orchestrate data movement fast enough — and far enough in advance — to keep compute busy. The constraint moves from GB to GB/s to scheduling quality.
§

Part 7 — Scheduled weight streaming: the stronger model

This is the architectural leap. Instead of "weights don't fit, so offload them," think: weights shouldn't be assumed to live permanently in HBM at all. Under this model, HBM is a bounded execution working set. Weights pass through it on schedule. They don't warehouse in it.

Fig. 06Reactive Offload vs. Scheduled Streaming — Execution Timeline
REACTIVE OFFLOAD — stall-dominated GPU compute PCIe transfer Compute L1 STALL — waiting for L2 Fetch L2 over PCIe Compute L2 STALL — waiting for L3 Fetch L3 over PCIe Compute L3 ··· Total time: dominated by stalls — GPU utilization low SCHEDULED STREAMING — compute-bound GPU compute PCIe transfer Compute L1 Prefetch L2 (ahead) Compute L2 Prefetch L3 (ahead) Compute L3 Prefetch L4 (ahead) Compute L4 Prefetch L5 (ahead) ··· Total time: transfer hidden — GPU stays compute-bound Condition for success: Transfer(Layer N) ≤ Compute(Layer N−1) i.e. next layer arrives before it's needed
In reactive offload, the GPU stalls at every layer boundary waiting for weights to arrive. In scheduled streaming, each layer's transfer is initiated during the preceding layer's computation — so the GPU never stalls. The condition for this to work: transfer time must not exceed compute time.

What changes under the streaming model

  • KV cache becomes the primary VRAM resident — not weights. The hot-tier assignment flips.
  • Only the active layer window lives in HBM — typically 1–2 layers (~2–4 GB for 70B INT4), not 35+ GB.
  • Prefetch scheduling is now a first-class system concern — the runtime must know which layer comes next and start the transfer at the right time.
  • Quantization is no longer optional — INT4 or INT8 is needed to keep the transfer budget inside the compute window.
  • Batch size is a performance lever, not just a throughput knob — larger batches mean longer compute time per layer, which gives the prefetch more time to complete.
The architectural implication: under scheduled streaming, the VRAM budget for weights shrinks from 35+ GB to ~4 GB. That ~30 GB delta goes directly to KV cache. At 128K context, a single 70B model can now run multiple concurrent long-context requests on a single H100.

Part 8 — VRAM as a working set, not a warehouse

The conceptual shift deserves to be stated clearly, because it changes how you think about every memory decision in the system.

Old model: warehouse

Store everything, use what you need

Weights are loaded at startup and kept permanently. VRAM is sized to fit the whole model. Memory pressure comes from fitting everything at once. Optimization target: model fits → done.

New model: working set

Keep only what's needed now

VRAM holds only the active execution window: current + next layer's weights, KV cache, buffers. Everything else is staged elsewhere. Optimization target: maximize useful residency, minimize stall cycles.

Fig. 07Working Set Model — What Should Live in HBM and Why
HBM (80 GB) — WORKING SET KV Cache ~60 GB — drives throughput, concurrency, and context length. Primary resident. ← NEVER evict without completing request Active layer weights ~2 GB (L_n) Prefetch buffer ~2 GB (L_n+1 in flight) PCIe stream (during Lₙ compute) HOST DRAM — STAGING TIER All model weights ~35 GB (70B INT4) Streamed layer-by-layer to HBM Spilled KV blocks (if needed) Evicted under extreme pressure NVMe / fabric storage Cold weights, checkpoints
Under the working set model, HBM is dominated by KV cache (~60 GB) rather than weights. The weight footprint shrinks to ~4 GB (active layer + prefetch buffer). Host DRAM becomes the staging tier. The critical PCIe stream must be scheduled to complete before each layer's compute window opens.
QuestionWarehouse modelWorking set model
What is HBM for?Store the whole modelHold the active execution window
Primary HBM resident?Weights (static, idle most of the time)KV cache (drives throughput now)
Optimization target?Fit the modelMaximize useful residency + overlap movement with compute
What happens when context grows?OOM — can't serveKV cache expands into freed weight space
Bottleneck?Memory capacityBandwidth + scheduling quality
Quantization role?Optional (accuracy tradeoff)Required (enables transfer/compute parity)

Part 9 — Why this architectural shift matters

There's a deeper pattern here worth naming. We're watching GPU memory undergo the same conceptual evolution that CPU memory went through decades ago — from a fixed allocation ("load your program, run it") to a dynamically managed working set (virtual memory, paging, OS-level orchestration).

PagedAttention was the first production instance of this thinking applied to AI inference. Scheduled weight streaming is the next. The long arc is toward a memory runtime that treats every byte of HBM as deliberately placed — not accidentally resident.

"The question changes. Instead of asking 'How do I fit the model?' the better question is 'How do I move the model through the machine efficiently enough to maximize useful residency?'" — the shift from warehouse to working set

The second-order implication

Once memory stops being a passive container and becomes an actively orchestrated execution resource, the system has opinions about what matters. That's the same architectural moment that justified operating systems, file systems, and virtual memory managers. For AI infrastructure, it suggests the next layer of the stack is not another offload library. It's a memory runtime with explicit residency semantics — one that speaks the language of weights, KV blocks, and attention state, not pages and virtual addresses.

That runtime needs to know what an object is, how long it lives, how urgently it's needed, and what it costs to evict or recompute. Readers of this series will recognize this as precisely the problem the Memory Intent IR was designed to solve.

The through-line: VRAM-as-working-set is the systems architecture argument. Memory Intent IR is the compiler mechanism that makes it work at line rate. PagedAttention-style block management is the KV cache layer. Together they sketch an AI memory runtime that manages residency the way an OS manages virtual memory — but with object semantics instead of page semantics.

Summary in six lines

  • HBM / VRAM is limited, high-bandwidth, and feeds compute directly.
  • Weights, KV cache, and buffers compete for it. KV cache drives revenue; weights are just a pre-condition.
  • PagedAttention showed what active KV cache management buys you. The answer was: a lot.
  • Naive weight offload helps with capacity but creates a bandwidth and stall problem.
  • Scheduled streaming solves the stall problem by prefetching ahead and overlapping transfer with compute.
  • The right abstraction: VRAM is a bounded execution working set. Weights pass through it. KV cache lives in it.
© 2026 Manish KL