1. The framing everyone uses — and what it misses
The standard explanation of speculative decoding goes roughly like this: a small, fast draft model generates a window of γ candidate tokens autoregressively. The large target model then verifies all γ candidates in a single parallel forward pass — the same computational cost as generating one token, but potentially accepting multiple. If the acceptance rate α is high enough, the system achieves super-linear effective throughput.
This framing is correct. The throughput argument is real and the math works. But it describes the mechanism entirely in terms of arithmetic: FLOPs saved, parallel verify step, acceptance rate. It says almost nothing about what happens to memory.
That matters because modern LLM serving is not usually compute-bound in the way the framing implies. For standard autoregressive decode, the bottleneck is memory bandwidth — reading weights and KV cache at each step. Speculative decoding adds a draft model with its own weights, its own KV state, and its own access patterns. It changes the memory system's workload before it changes the compute system's workload.
2. The dual KV cache problem
In standard autoregressive decode, the KV cache for a request grows at one token per step: attention keys and values for all layers, for all tokens seen so far. For a 70B-class model at 128K context, this is already tens of gigabytes per session.
Speculative decoding adds a second KV cache: one for the draft model. The draft model also needs to attend to the full prior context at each draft step. Its KV cache is smaller per layer (the draft model has fewer and narrower layers), but it is not zero. And it has to stay resident while the draft window is being generated, while the target model is verifying, and through potential rollback to an earlier position.
The memory math is concrete. A 7B draft model in BF16 occupies roughly 14GB of weight storage. Its per-token, per-layer KV footprint is smaller than the target model's, but across a 128K-token context it accumulates meaningfully. At γ = 6 draft tokens per verify step, the draft model must maintain its KV state across those 6 pending positions while the target model runs verification. Neither cache can be discarded until the verify result is known.
3. Rollback mechanics and their memory cost
When the target model rejects a draft token at position k, the system must roll back to position k and discard the KV state that was provisionally extended for draft tokens k+1 through γ. This is not free. The KV cache must be truncated, the draft model's own KV cache must be wound back, and the position tracking for all subsequent layers must be updated.
In practice, rollback behavior has three memory consequences that are easy to underestimate:
Speculative KV pages cannot be eagerly freed
During the draft phase, the KV cache is extended speculatively for each draft token. These pages cannot be freed until the verify step completes and returns an accept/reject decision for each position. The system must hold the speculative pages in a pending state — consuming HBM — for the full duration of the verify pass. At γ = 8 and a target model verify pass that takes several milliseconds, this holding cost is non-trivial at scale.
Rollback creates non-contiguous free space
Partial accepts — the most common outcome, where the target accepts tokens 0..k and rejects token k+1 — leave KV allocations that are partially valid and partially stale. Freeing only the rejected portion creates KV page fragmentation of exactly the kind the HBM Fragmentation Guard essay described. The allocator now has to manage mixed-validity pages across active sessions.
Frequent rollbacks amplify bandwidth consumption
Every rollback that touches HBM — to truncate the KV cache, reload the draft model's position embeddings, and re-read the last accepted token's attention state — burns bandwidth. When acceptance rates are low (α < 0.7), the bandwidth spent on rollbacks can exceed the bandwidth saved by avoiding full autoregressive steps in the target model.
4. Verify-phase bandwidth spikes
The verify pass is where speculative decoding achieves its theoretical speedup: the target model processes γ draft tokens in parallel, essentially for free compared to generating them autoregressively. But this parallelism comes at a memory bandwidth cost that standard decode does not have.
In standard autoregressive decode, the target model reads its KV cache for one new token position per step. The reads are focused: fetch keys and values for all layers, for all prior positions, to compute attention for one new query. The memory access pattern is regular and predictable.
In the verify pass, the target model reads its KV cache for γ new positions simultaneously. All γ query vectors need to attend to the full prior context. The access pattern expands: instead of one query over N prior KV entries, there are γ queries over N prior entries in parallel. This is a γ× increase in attention read volume in a single forward pass.
This bandwidth spike is not constant — it occurs at the verify step and then returns to normal. But at high concurrency, with many sessions running verify passes simultaneously or in rapid succession, the aggregate bandwidth pattern shifts from smooth to bursty. Systems optimized for smooth bandwidth loads perform worse under this bursty profile.
5. Draft model residency strategy
The draft model's weights are always-resident throughout speculative serving — you cannot page them out between draft and verify because the draft step and verify step happen in rapid alternation. This creates a residency commitment that does not exist in standard decode: a fixed additional HBM allocation for the draft model that reduces the space available for KV caches, which in turn reduces maximum concurrency.
There are three deployment strategies for the draft model's weights, each with different memory tradeoffs:
The co-resident strategy is most common in published benchmarks because it gives the cleanest speedup numbers. But those benchmarks typically run at low concurrency on long-idle GPUs — the exact conditions where HBM pressure is least visible. Production serving under real concurrency pressure often favors the separate-GPU or draft-free approaches to recover batch size.
6. Acceptance rate as a memory scheduling signal
The acceptance rate α is the most-cited metric in speculative decoding literature because it directly determines the expected tokens-per-step gain. But α is also a memory signal: when α drops, the system executes more rollbacks, holds speculative KV pages for longer on average, and wastes verify-phase bandwidth on tokens that will be discarded.
A memory-aware serving stack should monitor α per session and adjust speculation parameters dynamically:
- When α is high (>0.85): increase γ, keep the draft model co-resident, allow longer speculative windows. The memory cost is justified by the acceptance revenue.
- When α is moderate (0.70–0.85): hold γ steady. The speedup is real but the memory overhead is meaningful. Consider reducing concurrent sessions if HBM is tight.
- When α drops below 0.70: reduce γ or suspend speculative decoding for that session. The rollback overhead and speculative KV holding cost may exceed the gains. Fall back to standard autoregressive decode.
7. Why this is a scheduler problem
The memory challenges in speculative decoding do not solve themselves — they require the serving scheduler to understand speculation mechanics explicitly. A scheduler that treats speculative and non-speculative sessions identically will over-allocate concurrency, trigger KV evictions that kill acceptance rate, and end up with worse throughput than naive autoregressive decode.
The scheduling decisions that matter:
Per-session speculation eligibility
Not every session benefits from speculation. Sessions with low acceptance rate (monitored per-session), very short remaining context, or active prefix-sharing should fall back to standard decode. The scheduler must track α per session and toggle speculation dynamically.
Verify-phase batching discipline
Running many verify passes simultaneously causes bandwidth spikes that compete with draft generation for other sessions. The scheduler should stagger verify calls to smooth the bandwidth profile, or batch them deliberately to maximize GPU utilization during the verify burst rather than spreading them across the critical path of unrelated sessions.
Speculative KV page admission control
The scheduler must track how many "pending speculative pages" — KV allocations for unverified positions — are outstanding at any time. When the budget nears the HBM limit, it should reduce γ or pause new draft steps rather than allow speculative page accumulation to crowd out committed KV pages from other sessions.
Rollback-aware KV page cleanup
After each verify step, the scheduler needs to release rejected speculative pages promptly and compact the KV allocator's free list if fragmentation has accumulated. Deferred cleanup leaves speculative pages occupying HBM that should be available to new requests. This is the same fragmentation problem the HBM Fragmentation Guard addressed, now specifically induced by rollback patterns.
Draft model weight tier management
Under high load, the draft model's weights should be eligible for demotion to host DRAM if concurrency pressure makes the speculation overhead negative. The scheduler should re-activate draft serving when load drops and HBM pressure relaxes. This is a weight orchestration decision, not just a static deployment choice.
8. Tree speculation makes the memory problem harder
Standard speculative decoding generates a single linear draft sequence. Tree-based speculation — where the draft model generates multiple branching candidate sequences simultaneously — amplifies throughput gains further but multiplies the memory cost.
With a beam tree of branching factor b and depth γ, the number of speculative KV pages held pending grows as O(b × γ). A beam of 4 with depth 8 means 32 simultaneous speculative extensions per session, each requiring KV state across all layers. At long context, this can consume several gigabytes per session even before the target model's own KV cache is counted.
9. The systems view: what actually determines whether speculation pays off
Published speculative decoding papers almost always measure speedup on isolated, single-session benchmarks with high acceptance rates and fresh GPUs. That is not what production serving looks like.
In production, speculative decoding pays off when and only when:
- The acceptance rate is consistently above the break-even threshold (~0.70) for the actual production input distribution — not the curated benchmark distribution.
- The additional HBM pressure from dual KV caches, speculative pages, and draft weights does not reduce maximum concurrency below the level that saturates the GPU's compute utilization.
- The verify-phase bandwidth spikes are absorbed without queuing delays on other sessions in the batch.
- The scheduler actively manages γ, speculation eligibility, and speculative page cleanup rather than treating speculation as a static configuration.
When all four conditions hold, speculative decoding compounds with other optimizations — sparse attention, KV quantization, prefill offload — to produce very significant throughput gains. When they do not hold, it reduces effective concurrency and may produce negative speedup relative to naive autoregressive decode.
10. What the right architecture looks like
A well-designed speculative serving system has several properties that go beyond the academic paper description:
Memory-aware speculation policy
γ is not a constant. It adapts per-session based on acceptance rate, current HBM pressure, and concurrent verify-phase count. The policy treats speculation as a resource-consuming optimization, not a free speedup.
Speculative KV page accounting
The allocator tracks draft-speculative pages separately from committed pages. Speculative pages are tagged with their verification deadline, released immediately on rollback, and counted against a global speculation budget.
Draft weight tier control
Draft model weights are treated as a high-priority, always-warm resident under normal conditions, but are eligible for DRAM demotion when batch size is constrained by HBM. Re-activation is triggered by throughput degradation signal.
Verify-phase schedule coordination
Verify passes are scheduled to avoid simultaneous bandwidth spikes across multiple sessions. Either staggered (smooth bandwidth) or batched-together (minimize total verify time) depending on whether throughput or latency is the primary objective.