Speculative Decoding Is a Memory Problem

2–3×

Typical token throughput improvement with well-tuned speculative decoding

~2×

HBM pressure increase from dual KV caches (draft + target) vs. standard decode

γ = 4–8

Typical speculation window (draft tokens per verify call) in production deployments

<70%

Acceptance rate below which speculative decoding often stops paying off vs. standard decode

1. The framing everyone uses — and what it misses

The standard explanation of speculative decoding goes roughly like this: a small, fast draft model generates a window of γ candidate tokens autoregressively. The large target model then verifies all γ candidates in a single parallel forward pass — the same computational cost as generating one token, but potentially accepting multiple. If the acceptance rate α is high enough, the system achieves super-linear effective throughput.

This framing is correct. The throughput argument is real and the math works. But it describes the mechanism entirely in terms of arithmetic: FLOPs saved, parallel verify step, acceptance rate. It says almost nothing about what happens to memory.

Speculative decoding was invented to defeat arithmetic bottlenecks. In practice, it often creates memory bottlenecks instead.

That matters because modern LLM serving is not usually compute-bound in the way the framing implies. For standard autoregressive decode, the bottleneck is memory bandwidth — reading weights and KV cache at each step. Speculative decoding adds a draft model with its own weights, its own KV state, and its own access patterns. It changes the memory system's workload before it changes the compute system's workload.

The shift: standard decode is memory-bandwidth-bound. Speculative decoding with a naive implementation is often memory-capacity-bound — because two full KV caches, two sets of active weights, and a batch of pending verify requests now compete for the same HBM.

2. The dual KV cache problem

In standard autoregressive decode, the KV cache for a request grows at one token per step: attention keys and values for all layers, for all tokens seen so far. For a 70B-class model at 128K context, this is already tens of gigabytes per session.

Speculative decoding adds a second KV cache: one for the draft model. The draft model also needs to attend to the full prior context at each draft step. Its KV cache is smaller per layer (the draft model has fewer and narrower layers), but it is not zero. And it has to stay resident while the draft window is being generated, while the target model is verifying, and through potential rollback to an earlier position.

Fig 1 — Memory layout: standard decode vs. speculative decode

Speculative decoding consumes materially more HBM for the same workload. The draft model adds both its own weight footprint and its own KV cache. On a 70B target model with a 7B draft, the additional weight load is roughly 14GB; the draft KV cache adds further pressure proportional to context length and batch size. The net effect is reduced concurrency headroom or forced offload of KV state to slower tiers.

The memory math is concrete. A 7B draft model in BF16 occupies roughly 14GB of weight storage. Its per-token, per-layer KV footprint is smaller than the target model's, but across a 128K-token context it accumulates meaningfully. At γ = 6 draft tokens per verify step, the draft model must maintain its KV state across those 6 pending positions while the target model runs verification. Neither cache can be discarded until the verify result is known.

The capacity equation: for a serving system running N concurrent 128K-context sessions, speculative decoding adds roughly 15–25% more HBM consumption per session. That directly reduces the maximum batch size. If the batch was previously concurrency-limited by HBM, speculation may help arithmetic but hurt throughput by forcing fewer concurrent requests.

3. Rollback mechanics and their memory cost

When the target model rejects a draft token at position k, the system must roll back to position k and discard the KV state that was provisionally extended for draft tokens k+1 through γ. This is not free. The KV cache must be truncated, the draft model's own KV cache must be wound back, and the position tracking for all subsequent layers must be updated.

In practice, rollback behavior has three memory consequences that are easy to underestimate:

Speculative KV pages cannot be eagerly freed

During the draft phase, the KV cache is extended speculatively for each draft token. These pages cannot be freed until the verify step completes and returns an accept/reject decision for each position. The system must hold the speculative pages in a pending state — consuming HBM — for the full duration of the verify pass. At γ = 8 and a target model verify pass that takes several milliseconds, this holding cost is non-trivial at scale.

Rollback creates non-contiguous free space

Partial accepts — the most common outcome, where the target accepts tokens 0..k and rejects token k+1 — leave KV allocations that are partially valid and partially stale. Freeing only the rejected portion creates KV page fragmentation of exactly the kind the HBM Fragmentation Guard essay described. The allocator now has to manage mixed-validity pages across active sessions.

Frequent rollbacks amplify bandwidth consumption

Every rollback that touches HBM — to truncate the KV cache, reload the draft model's position embeddings, and re-read the last accepted token's attention state — burns bandwidth. When acceptance rates are low (α < 0.7), the bandwidth spent on rollbacks can exceed the bandwidth saved by avoiding full autoregressive steps in the target model.

Fig 2 — Speculative decode rollback: memory state transitions

During the draft phase, KV pages for speculative positions cannot be freed — the accept/reject decision is not yet known. After a partial accept (d₁ accepted, d₂ rejected), both the target and draft KV caches must roll back to the last accepted position. The rollback is not just a pointer move; it involves freeing pages and potentially handling fragmentation in the KV allocator.

4. Verify-phase bandwidth spikes

The verify pass is where speculative decoding achieves its theoretical speedup: the target model processes γ draft tokens in parallel, essentially for free compared to generating them autoregressively. But this parallelism comes at a memory bandwidth cost that standard decode does not have.

In standard autoregressive decode, the target model reads its KV cache for one new token position per step. The reads are focused: fetch keys and values for all layers, for all prior positions, to compute attention for one new query. The memory access pattern is regular and predictable.

In the verify pass, the target model reads its KV cache for γ new positions simultaneously. All γ query vectors need to attend to the full prior context. The access pattern expands: instead of one query over N prior KV entries, there are γ queries over N prior entries in parallel. This is a γ× increase in attention read volume in a single forward pass.

Bandwidth amplification: the verify pass reads γ times more KV data per step than standard decode. For γ = 8, that is an 8× spike in attention-related HBM reads. On models where attention dominates bandwidth consumption (long context, many heads), this spike can briefly saturate HBM bandwidth and delay the very step that was supposed to provide the speedup.

This bandwidth spike is not constant — it occurs at the verify step and then returns to normal. But at high concurrency, with many sessions running verify passes simultaneously or in rapid succession, the aggregate bandwidth pattern shifts from smooth to bursty. Systems optimized for smooth bandwidth loads perform worse under this bursty profile.

Fig 3 — HBM bandwidth profile: standard decode vs. speculative decode

Standard autoregressive decode creates a smooth, predictable bandwidth load on HBM. Speculative decoding alternates between low-bandwidth draft steps and high-bandwidth verify spikes. The verify spike can be γ× higher than the draft step bandwidth. This bursty profile is harder for memory schedulers to optimize, and can cause queuing delays at the HBM bandwidth boundary when multiple sessions are in the verify phase simultaneously.

5. Draft model residency strategy

The draft model's weights are always-resident throughout speculative serving — you cannot page them out between draft and verify because the draft step and verify step happen in rapid alternation. This creates a residency commitment that does not exist in standard decode: a fixed additional HBM allocation for the draft model that reduces the space available for KV caches, which in turn reduces maximum concurrency.

There are three deployment strategies for the draft model's weights, each with different memory tradeoffs:

Strategy Memory tradeoff When it wins

Co-resident (same GPU)

Draft + target weights both in HBM. Fastest draft step; maximum KV pressure.

High acceptance rate, short context, modest concurrency

Draft on separate GPU

Draft weights on dedicated GPU. No co-occupancy; add NVLink round-trip per draft step.

Long context, many sessions, cost-insensitive deployments

Draft from DRAM (paged)

Draft weights in host DRAM; page into HBM for draft step only. Cheapest HBM cost; slowest draft.

Memory-constrained serving where draft acceleration is marginal

Draft-free (Medusa-style)

Draft heads attached to target model; no separate weights. Lower overhead; less flexibility.

When draft model architecture closely mirrors target distribution

The co-resident strategy is most common in published benchmarks because it gives the cleanest speedup numbers. But those benchmarks typically run at low concurrency on long-idle GPUs — the exact conditions where HBM pressure is least visible. Production serving under real concurrency pressure often favors the separate-GPU or draft-free approaches to recover batch size.

6. Acceptance rate as a memory scheduling signal

The acceptance rate α is the most-cited metric in speculative decoding literature because it directly determines the expected tokens-per-step gain. But α is also a memory signal: when α drops, the system executes more rollbacks, holds speculative KV pages for longer on average, and wastes verify-phase bandwidth on tokens that will be discarded.

A memory-aware serving stack should monitor α per session and adjust speculation parameters dynamically:

When α is high (>0.85): increase γ, keep the draft model co-resident, allow longer speculative windows. The memory cost is justified by the acceptance revenue.
When α is moderate (0.70–0.85): hold γ steady. The speedup is real but the memory overhead is meaningful. Consider reducing concurrent sessions if HBM is tight.
When α drops below 0.70: reduce γ or suspend speculative decoding for that session. The rollback overhead and speculative KV holding cost may exceed the gains. Fall back to standard autoregressive decode.

Key insight: acceptance rate is not just a throughput signal. It is a memory policy signal. Low acceptance means the system is paying memory cost — speculative KV holds, rollback bandwidth, fragmentation — for benefits that are not materializing. A scheduler that ties α to admission control and γ adjustment manages memory far better than one that treats speculation as all-or-nothing.

Fig 4 — Effective throughput vs. HBM pressure across acceptance rates

Below α ≈ 0.70, HBM overhead from speculative decoding (rollbacks, speculative KV holds, fragmentation) exceeds the throughput benefit. Above 0.85, the throughput gain accelerates sharply and the memory cost is clearly justified. The break-even point is where adaptive γ adjustment matters most: reduce γ when α drops toward the threshold rather than waiting for it to fall below.

7. Why this is a scheduler problem

The memory challenges in speculative decoding do not solve themselves — they require the serving scheduler to understand speculation mechanics explicitly. A scheduler that treats speculative and non-speculative sessions identically will over-allocate concurrency, trigger KV evictions that kill acceptance rate, and end up with worse throughput than naive autoregressive decode.

The scheduling decisions that matter:

Per-session speculation eligibility

Not every session benefits from speculation. Sessions with low acceptance rate (monitored per-session), very short remaining context, or active prefix-sharing should fall back to standard decode. The scheduler must track α per session and toggle speculation dynamically.

Verify-phase batching discipline

Running many verify passes simultaneously causes bandwidth spikes that compete with draft generation for other sessions. The scheduler should stagger verify calls to smooth the bandwidth profile, or batch them deliberately to maximize GPU utilization during the verify burst rather than spreading them across the critical path of unrelated sessions.

Speculative KV page admission control

The scheduler must track how many "pending speculative pages" — KV allocations for unverified positions — are outstanding at any time. When the budget nears the HBM limit, it should reduce γ or pause new draft steps rather than allow speculative page accumulation to crowd out committed KV pages from other sessions.

Rollback-aware KV page cleanup

After each verify step, the scheduler needs to release rejected speculative pages promptly and compact the KV allocator's free list if fragmentation has accumulated. Deferred cleanup leaves speculative pages occupying HBM that should be available to new requests. This is the same fragmentation problem the HBM Fragmentation Guard addressed, now specifically induced by rollback patterns.

Draft model weight tier management

Under high load, the draft model's weights should be eligible for demotion to host DRAM if concurrency pressure makes the speculation overhead negative. The scheduler should re-activate draft serving when load drops and HBM pressure relaxes. This is a weight orchestration decision, not just a static deployment choice.

8. Tree speculation makes the memory problem harder

Standard speculative decoding generates a single linear draft sequence. Tree-based speculation — where the draft model generates multiple branching candidate sequences simultaneously — amplifies throughput gains further but multiplies the memory cost.

With a beam tree of branching factor b and depth γ, the number of speculative KV pages held pending grows as O(b × γ). A beam of 4 with depth 8 means 32 simultaneous speculative extensions per session, each requiring KV state across all layers. At long context, this can consume several gigabytes per session even before the target model's own KV cache is counted.

Tree speculation at scale: for tree-based schemes (EAGLE, Hydra, SpecTree), the memory policy question becomes even more acute. The scheduler must manage a tree of speculative KV pages with different acceptance probabilities per branch, decide which branches to prune early on memory pressure, and handle the fragmentation pattern left by branch-level rollbacks. This is a materially harder scheduling problem than linear speculation.

9. The systems view: what actually determines whether speculation pays off

Published speculative decoding papers almost always measure speedup on isolated, single-session benchmarks with high acceptance rates and fresh GPUs. That is not what production serving looks like.

In production, speculative decoding pays off when and only when:

The acceptance rate is consistently above the break-even threshold (~0.70) for the actual production input distribution — not the curated benchmark distribution.
The additional HBM pressure from dual KV caches, speculative pages, and draft weights does not reduce maximum concurrency below the level that saturates the GPU's compute utilization.
The verify-phase bandwidth spikes are absorbed without queuing delays on other sessions in the batch.
The scheduler actively manages γ, speculation eligibility, and speculative page cleanup rather than treating speculation as a static configuration.

When all four conditions hold, speculative decoding compounds with other optimizations — sparse attention, KV quantization, prefill offload — to produce very significant throughput gains. When they do not hold, it reduces effective concurrency and may produce negative speedup relative to naive autoregressive decode.

The teams that will capture the full value of speculative decoding are not the ones who implement the best draft model. They are the ones who build the memory scheduler that makes it possible to run speculation safely at real serving concurrency.

10. What the right architecture looks like

A well-designed speculative serving system has several properties that go beyond the academic paper description:

Memory-aware speculation policy

γ is not a constant. It adapts per-session based on acceptance rate, current HBM pressure, and concurrent verify-phase count. The policy treats speculation as a resource-consuming optimization, not a free speedup.

Speculative KV page accounting

The allocator tracks draft-speculative pages separately from committed pages. Speculative pages are tagged with their verification deadline, released immediately on rollback, and counted against a global speculation budget.

Draft weight tier control

Draft model weights are treated as a high-priority, always-warm resident under normal conditions, but are eligible for DRAM demotion when batch size is constrained by HBM. Re-activation is triggered by throughput degradation signal.

Verify-phase schedule coordination

Verify passes are scheduled to avoid simultaneous bandwidth spikes across multiple sessions. Either staggered (smooth bandwidth) or batched-together (minimize total verify time) depending on whether throughput or latency is the primary objective.

The connection to the broader series: speculative decoding is another instance of the pattern this writing series keeps returning to — a technique that appears to solve a compute problem but really creates a memory orchestration problem. Sparse attention, KV tiering, weight streaming, and now speculative decoding all follow the same pattern: once the arithmetic is cheap enough, memory policy becomes the discriminating factor between systems that achieve the theoretical benefit and systems that do not.