For a long time the dominant mental model for AI inference was simple: load the model weights into GPU memory, keep them there, and run inference as fast as the hardware allows. That model worked when models were smaller and GPU memory was mostly treated as a place to store the model. Modern inference systems expose a different reality.
Part 1 — What people mean by VRAM and HBM
In casual conversation, people say "VRAM" to mean "GPU memory." That's fine informally, but for systems work the distinction matters.
VRAM is the general term for any memory attached to a GPU. HBM — High Bandwidth Memory — is the specific technology used in modern AI accelerators. It's not just fast memory; it's memory stacked in 3D dies directly adjacent to the GPU die, connected via a wide silicon interposer rather than a conventional PCB trace. That physical proximity is what makes it extraordinary.
HBM matters not just because it's memory, but because it can sustain the data rate that modern GPU tensor cores need. A100 and H100 tensor cores can theoretically perform thousands of TFLOP/s — but only if they're continuously fed with data. If weights and activations arrive too slowly, cores stall and that compute goes to waste. The memory system is the bottleneck at scale.
Part 2 — What actually lives in GPU memory
During inference, GPU memory is not consumed by one thing. It's shared across several major tenants — and they compete with each other for the same scarce bytes.
The three tenants and their characters
- Weights — static, large (35 GB for 70B in BF16), reused every forward pass. They're the most evictable: read-only, fully predictable access pattern, layer-by-layer.
- KV cache — dynamic, grows with context and concurrency, directly drives serving throughput. Losing KV cache means recomputation. It's the highest-value resident per byte.
- Activations and buffers — ephemeral per token/step, generally small but non-negotiable. Workspace for the active forward pass.
Part 3 — Why KV cache is the crux
The KV cache is the most misunderstood concept in LLM inference. People reach for it to explain memory pressure, but rarely work through the math to understand why it dominates at scale.
The mathematics of KV cache size
For a transformer model with the following parameters:
This is the core tension: KV cache is unbounded and valuable. Weights are large and static. The naive approach — "fit everything in VRAM simultaneously" — simply doesn't scale to modern serving requirements.
Part 4 — What vLLM and PagedAttention actually solved
vLLM emerged from the observation that naive LLM serving wastes enormous GPU memory — not because there isn't enough, but because it's managed badly.
Before PagedAttention, serving frameworks pre-allocated a contiguous KV cache buffer for each request at its maximum possible length. If a request might reach 4K tokens, 4K tokens of HBM were reserved — even if the request only used 300. The rest was stranded. Under high concurrency, this fragmentation meant GPUs could be memory-bound even when utilization metrics looked fine.
PagedAttention matters here not just as a KV cache trick, but as a signal that the field was beginning to treat GPU memory as something to be orchestrated rather than simply allocated. It is the first major production-grade step away from "GPU memory as warehouse."
Part 5 — Weight offload: the right problem, imperfect solution
If weights consume the largest static slice of VRAM, the most obvious solution is: move them somewhere cheaper. Keep them in host RAM (CPU DRAM) and stream them to the GPU on demand. This is the intuition behind weight offload approaches used in frameworks like Hugging Face Accelerate, DeepSpeed Inference, and others.
Why it helps
Weights stay in HBM
Full 35 GB of a 70B model in BF16 occupies HBM permanently. That leaves ~45 GB for KV cache — enough for moderate concurrency at moderate context length, but you hit the wall fast.
Weights move to host RAM
Weights live in CPU DRAM (~TBs available). GPU fetches each layer when needed. Full ~80 GB is theoretically available for KV cache. But now execution stalls waiting for layer weights to arrive over PCIe.
Why naive offload fails at scale
The fatal problem with simple offload is that it's reactive. The runtime discovers it needs layer N's weights, then issues a PCIe DMA to fetch them from host memory. While that transfer is in flight, the GPU's tensor cores sit idle. This is the worst possible outcome for an expensive accelerator.
Part 6 — The bandwidth wall
Once you commit to moving weights off-chip, the bottleneck shifts from capacity to bandwidth and scheduling. Understanding this precisely is the key to building a system that actually works.
This is the key constraint: the transfer budget must fit inside the compute window of the preceding layer. If it doesn't, the GPU stalls. The three independent levers that bring transfer time down are: quantization (reduces bytes), batching (increases compute time per layer pass), and prefetch scheduling (starts transfers before they're urgently needed).
Part 7 — Scheduled weight streaming: the stronger model
This is the architectural leap. Instead of "weights don't fit, so offload them," think: weights shouldn't be assumed to live permanently in HBM at all. Under this model, HBM is a bounded execution working set. Weights pass through it on schedule. They don't warehouse in it.
What changes under the streaming model
- KV cache becomes the primary VRAM resident — not weights. The hot-tier assignment flips.
- Only the active layer window lives in HBM — typically 1–2 layers (~2–4 GB for 70B INT4), not 35+ GB.
- Prefetch scheduling is now a first-class system concern — the runtime must know which layer comes next and start the transfer at the right time.
- Quantization is no longer optional — INT4 or INT8 is needed to keep the transfer budget inside the compute window.
- Batch size is a performance lever, not just a throughput knob — larger batches mean longer compute time per layer, which gives the prefetch more time to complete.
Part 8 — VRAM as a working set, not a warehouse
The conceptual shift deserves to be stated clearly, because it changes how you think about every memory decision in the system.
Store everything, use what you need
Weights are loaded at startup and kept permanently. VRAM is sized to fit the whole model. Memory pressure comes from fitting everything at once. Optimization target: model fits → done.
Keep only what's needed now
VRAM holds only the active execution window: current + next layer's weights, KV cache, buffers. Everything else is staged elsewhere. Optimization target: maximize useful residency, minimize stall cycles.
| Question | Warehouse model | Working set model |
|---|---|---|
| What is HBM for? | Store the whole model | Hold the active execution window |
| Primary HBM resident? | Weights (static, idle most of the time) | KV cache (drives throughput now) |
| Optimization target? | Fit the model | Maximize useful residency + overlap movement with compute |
| What happens when context grows? | OOM — can't serve | KV cache expands into freed weight space |
| Bottleneck? | Memory capacity | Bandwidth + scheduling quality |
| Quantization role? | Optional (accuracy tradeoff) | Required (enables transfer/compute parity) |
Part 9 — Why this architectural shift matters
There's a deeper pattern here worth naming. We're watching GPU memory undergo the same conceptual evolution that CPU memory went through decades ago — from a fixed allocation ("load your program, run it") to a dynamically managed working set (virtual memory, paging, OS-level orchestration).
PagedAttention was the first production instance of this thinking applied to AI inference. Scheduled weight streaming is the next. The long arc is toward a memory runtime that treats every byte of HBM as deliberately placed — not accidentally resident.
The second-order implication
Once memory stops being a passive container and becomes an actively orchestrated execution resource, the system has opinions about what matters. That's the same architectural moment that justified operating systems, file systems, and virtual memory managers. For AI infrastructure, it suggests the next layer of the stack is not another offload library. It's a memory runtime with explicit residency semantics — one that speaks the language of weights, KV blocks, and attention state, not pages and virtual addresses.
That runtime needs to know what an object is, how long it lives, how urgently it's needed, and what it costs to evict or recompute. Readers of this series will recognize this as precisely the problem the Memory Intent IR was designed to solve.
Summary in six lines
- HBM / VRAM is limited, high-bandwidth, and feeds compute directly.
- Weights, KV cache, and buffers compete for it. KV cache drives revenue; weights are just a pre-condition.
- PagedAttention showed what active KV cache management buys you. The answer was: a lot.
- Naive weight offload helps with capacity but creates a bandwidth and stall problem.
- Scheduled streaming solves the stall problem by prefetching ahead and overlapping transfer with compute.
- The right abstraction: VRAM is a bounded execution working set. Weights pass through it. KV cache lives in it.