The Overhead Nobody Talks About

The benchmark tables for large language model inference typically measure tokens per second, time to first token, and model perplexity. What they rarely measure — because it is embarrassingly wasteful — is how much time the GPU spends waiting for the CPU to tell it what to do next.

In a conventional decode loop, every single token generation cycle involves a round-trip between host and device. The CPU launches a kernel. The GPU runs it. The CPU synchronizes, reads back the result, and decides what to launch next. For a 256-token response, that is 256 kernel launches, 256 host synchronizations, and 256 PCIe round-trips. At 5–10 microseconds per launch, the orchestration overhead alone can consume 1–2 milliseconds per response — a significant fraction of total decode time in the small-batch, low-latency regime that interactive products require.

conventional host-driven decode loop C++ / CUDA
// CPU orchestrates every token. GPU waits between each launch.
for (int t = 0; t < max_tokens; t++) {
    launch_kernel(attention, stream);    // ~5–10 µs overhead
    cudaStreamSynchronize(stream);       // host stalls
    launch_kernel(projection, stream);
    cudaStreamSynchronize(stream);
    launch_kernel(sample_token, stream);
    cudaStreamSynchronize(stream);
    int next_tok = read_from_device();   // PCIe read
    if (next_tok == eos_id) break;
}

Speculative decoding was supposed to help. The idea — propose multiple tokens with a cheap draft model, verify them in parallel with the target model — dramatically reduces the number of target model calls. But the canonical speculative decode implementation still operates within this host-driven loop. The draft model is launched by the host. The verification kernel is launched by the host. The accepted count is read back by the host. The block size for the next iteration is decided by the host.

"You replaced the per-token round-trip with a per-block round-trip. The fundamental problem — the GPU waits for the CPU — is unchanged."

This is precisely the problem that the persistent mega-kernel architecture addresses — and it is the architectural context in which Adaptive Speculative Block Sizing becomes not just useful but necessary.


The Persistent Mega-Kernel: One Launch to Rule Them All

A persistent mega-kernel launches once per batch and never returns to the host until every request in the batch is complete. The token loop — prefill, decode, speculative verify, commit, repeat — executes entirely on the device. The GPU owns the control flow. The CPU polls a completion flag and does nothing else.

BASELINE (HOST-LAUNCHED) t₀ t₁ t₂ t₃ t₄ ··· tₙ ↑sync ↑sync ↑sync ↑sync ↑sync ··· ↑sync N kernel launches N synchronizations PERSISTENT MEGA-KERNEL xl_persistent_megakernel <<<N, block>>> t₀ t₁ t₂ t₃ t₄ ··· tₙ — all on-device ↑ 1 sync 1 kernel launch 1 synchronization
Fig. 2 — Baseline vs persistent mega-kernel. Launch count: O(tokens) → 1.

The XL-Persistent-Kernel project implements this architecture as a correctness-first scaffold. The kernel — xl_persistent_megakernel.cu — launches with one block per request. Inside each block, a device loop calls six inline stage helpers:

xl_persistent_megakernel.cu — device resident loop CUDA C++
__global__ void xl_persistent_megakernel(
    RequestDescriptor* requests, int num_requests,
    KVPageTable kv_table, int* draft_tokens,
    int* shutdown_flag, int max_iterations, int block_size
) {
    RequestDescriptor* req = &requests[blockIdx.x];

    for (int iter = 0; !(*shutdown_flag) && iter < max_iterations; iter++) {
        if (!req->is_done()) {
            stage_prefill(req, &kv_table);        // PREFILL_READY → DECODE_READY
            stage_decode(req, draft_tokens, block_size); // propose B tokens
            stage_spec_verify(req, draft_tokens);  // accept prefix
            stage_commit(req, &kv_table);          // commit accepted
        }
        // block 0 polls all-done; sets *shutdown_flag
    }
}

The stage helpers are __forceinline__ __device__ functions — they compile directly into the mega-kernel body, not as separate launches. stage_prefill, stage_decode, stage_spec_verify, stage_commit, stage_kv, and stage_scheduler all live in the device loop. This is a single resident kernel that carries the full inference pipeline.

Token Parallelization Within the Mega-Kernel

The mega-kernel architecture enables a form of token-level parallelization that host-launched approaches simply cannot achieve. In the baseline path, requests are processed one token step at a time, sequentially controlled by the host. In the persistent kernel, N blocks execute simultaneously — one per request — and their decode iterations are fully parallel. Requests that finish prefill early begin decoding while others are still being filled. There is no rendezvous at the host boundary.

INTRA-BATCH TOKEN PARALLELISM INSIDE THE PERSISTENT KERNEL time → REQ 0 PREFILL B=8 B=8 B=4 done REQ 1 PREFILL (longer) B=8 B=6 B=4 done REQ 2 PREFILL B=8 2 ← KV pressure: B capped at 2 done REQ 3 PREFILL B=8 done prefill speculative decode (block B) KV-pressure capped (B≤2) complete
Fig. 3 — All 4 requests execute in parallel inside the mega-kernel. Each block independently adapts its speculative block size. REQ 2 hits KV pressure and caps at B=2 while others continue at B=8.

Each CUDA block executing a request is fully independent. Block 0 running request 0 never waits for block 1 running request 1 — they interleave on the GPU's SM scheduler freely. A request that finishes prefill early immediately begins speculative decode. A request under KV pressure silently reduces its block size. None of this requires any host-side coordination.

This independence is the key architectural property that makes ASBS not just possible but natural inside a mega-kernel. In a host-driven loop, the host picks a single block size for the batch. In a mega-kernel, each request can adapt independently on every iteration.

Why Fixed Block Sizes Are Leaving Performance on the Table

In a persistent mega-kernel, the GPU never reports back to the host between decode iterations. This means any intelligence about block sizing must live on the device. Fixed block sizes are the absence of that intelligence.

Adaptive Speculative Block Sizing

The observation is straightforward: speculative decoding acceptance rate is not constant across a request's lifetime, not constant across requests in a batch, and not constant under varying memory conditions. A fixed block size optimized for one condition is suboptimal in all others.

What changes across a request's lifetime

Consider a typical chat completion request. After a structured system prompt, the model's distribution is strongly peaked — the next several tokens are nearly deterministic, and a draft model will predict them with high accuracy. Acceptance rates of 80–90% are common in this regime. Proposing 8 tokens at once costs almost nothing and commits almost all of them.

As the response becomes more creative — narrative, opinion, code — the distribution flattens. The draft model begins making choices the target model disagrees with. Acceptance rates drop toward 50–60%. A block of 8 proposed tokens now yields 4–5 committed tokens on average, but the cost of the verify pass is the same. And critically, the KV cache has to hold draft pages for all 8 proposed tokens until verification completes — 8 pages that are likely to be wasted.

Near the end of generation — trailing punctuation, EOS approach — acceptance rates can drop further still. Proposing a large block here burns KV pages and verify compute for tokens that will almost never be accepted.

What changes across requests in a batch

In a continuous batching scenario, different requests have completely different acceptance rate profiles at any given moment. Request A may be in its high-acceptance structured zone while Request B is in a low-acceptance creative zone. A single batch-level block size is a compromise that serves neither well.

What changes under KV pressure

The XL-Persistent-Kernel KV cache distinguishes between committed pages (finalized token data, protected from eviction) and draft pages (speculative, discardable on rejection). The flag REQUEST_FLAG_KV_PRESSURE is already defined in kernel_status.h. When the cache is near capacity, every draft page allocated for a speculative block is a page that cannot hold committed data for another request. Under pressure, a block of 8 draft tokens occupies 8 × (layer_count × bytes_per_kv_token) bytes of cache that may yield nothing if the block is rejected.

The Core Insight

In a host-launched system, the host could inspect acceptance rates between rounds and adjust block size. In a persistent mega-kernel, the host never sees per-token results during decode. Any adaptive policy must be computed on the device, inline, during the verify stage. ASBS is the natural first inference optimization that is only possible inside a persistent kernel.

The EMA model

ASBS tracks a per-request exponential moving average of the observed acceptance rate. After each verify pass, the EMA is updated inline in stage_spec_verify.cuh before the kernel's next iteration. The EMA-smoothed estimate drives the block size for the next proposal.

ε' = α · r + (1 − α) · ε where r = accepted_count / proposed_count, α = 0.2, ε = current EMA

B_next = max_block     if ε' ≥ 0.80
B_next = mid_block     if 0.50 ≤ ε' < 0.80
B_next = min_block     if ε' < 0.50
B_next = min(B_next, 2) if KV_PRESSURE flag is set

The EMA smoothing factor of 0.2 was chosen to give meaningful responsiveness — a single bad iteration moves the estimate by 20% of the gap — while avoiding overreaction to transient noise. A single rejection after ten acceptances should not immediately collapse block size to 1; the EMA absorbs the shock.

The CUDA implementation

On the CUDA side, the change touches three files. Two fields are added to RequestDescriptor:

request_desc.h — two new fields CUDA C++
struct RequestDescriptor {
    // ... existing fields ...
    float  ema_acceptance_rate;   // initialized to 0.8f (optimistic)
    int    current_block_size;    // initialized to launch-time block_size
};

Then stage_spec_verify.cuh gains an inline update after it computes accepted_prefix_len:

stage_spec_verify.cuh — inline ASBS update CUDA C++
__device__ __forceinline__ void update_block_size_device(
    RequestDescriptor* req, int accepted, int proposed,
    int min_block, int max_block, int mid_block
) {
    float r = (proposed > 0) ? (float)accepted / (float)proposed : 0.0f;
    constexpr float alpha = 0.2f;
    req->ema_acceptance_rate = alpha * r + (1.0f - alpha) * req->ema_acceptance_rate;

    int next_b;
    if      (req->ema_acceptance_rate >= 0.80f) next_b = max_block;
    else if (req->ema_acceptance_rate >= 0.50f) next_b = mid_block;
    else                                          next_b = min_block;

    // Cap under KV pressure — protect committed pages
    if (req->has_flag(REQUEST_FLAG_KV_PRESSURE))
        next_b = min(next_b, 2);

    req->current_block_size = next_b;
}

// Called at the end of stage_spec_verify, before returning:
update_block_size_device(req, req->accepted_prefix_len, req->draft_len,
                          1, 8, 4);
req->set_state(REQUEST_COMMIT_READY);

Finally, stage_decode.cuh uses req->current_block_size instead of the kernel parameter, and the block_size parameter is removed from the mega-kernel signature — replaced by per-request initialization at launch time.

The Python specification

The Python simulator in src/megakernel_lab/ is the correctness reference for the CUDA layer. ASBS maps cleanly into it. RequestState gains ema_acceptance_rate and current_block_size fields. An AdaptiveBlockPolicy dataclass captures the EMA parameters. The update_block_size() function runs in DecodeWorker.process() after each verify step, reading KVCache.is_under_pressure() to check the pressure flag. Crucially, the Python implementation uses identical thresholds (0.80 / 0.50 / α = 0.2) as the CUDA implementation, preserving the simulator-as-specification guarantee.


What ASBS Unlocks — and What to Measure

KV cache efficiency

The most immediate win is KV page utilization. Under a fixed block size of 8, a request with 40% acceptance rate allocates 8 draft pages per iteration but commits only 3. Five pages are allocated, touch the LRU order, and are then discarded — forcing real committed pages of other requests out of the cache. With ASBS, the same request drops to block size 1 once its EMA falls below 0.50, allocating only 1 draft page per iteration. Cache fragmentation decreases. The fragmentation_ratio metric already exposed by kv_cache.residency_report() is the direct measurement.

Verify pass efficiency

The speculative verify pass has O(block_size) cost in the CUDA stub (it loops over draft_len). With real transformer math in Phase 3, verification involves a forward pass over the proposed tokens — cost proportional to block length. ASBS reduces the average verify cost for struggling requests without touching high-acceptance requests, where the large block size is well-justified.

What to measure in the benchmark harness

The existing BenchmarkRunner in bench.py already tracks TTFT, inter-token latency percentiles (p50/p95/p99), acceptance rate, and KV hit rate. ASBS adds two new observables:

Metric Expected with fixed B Expected with ASBS
mean_block_size (high-accept request) fixed at 4 or 8 converges to max_block (8)
mean_block_size (low-accept request) wasteful at 8 converges to min_block (1–2)
KV fragmentation_ratio high under low acceptance reduced — fewer wasted drafts
kv_hit_rate under pressure degrades as draft pages evict committed stabilizes — B capped at 2
p95 inter-token latency dominated by rejected blocks smoothed by adaptive sizing

The block_size_history: list[int] field added to DecodeStepTrace captures the per-iteration block size trajectory. Plot it for a batch of requests with varied acceptance profiles: the expected signature is a high block-size plateau early, potential mid-decode dip for creative content, and a sharp collapse to minimum near generation end and under pressure events. That signature, visible in the Python simulator before a single CUDA line changes, validates the policy.

Why This Only Works in a Mega-Kernel

In a host-launched design, acceptance rates are available to the host — it reads them back after each verify pass. But that readback costs a PCIe round-trip, and computing the EMA and choosing the next block size adds CPU-side latency. More critically, the decision happens between launches, which means the GPU is idle while the CPU decides. In a mega-kernel, the EMA update and block size decision execute in a handful of arithmetic instructions at the end of stage_spec_verify, with the GPU never pausing. ASBS has zero overhead in a mega-kernel and non-trivial overhead in a host-launched loop.


How to Build It

ASBS slots into the XL-Persistent-Kernel roadmap between Phase 2B (measurement harness) and Phase 3 (real kernels). It is entirely a control-flow and state change — no real math required. The Python simulator validates it; the CUDA scaffold carries it to hardware.

1

Extend RequestState in state.py

Add ema_acceptance_rate: float = 0.8, current_block_size: int, and block_size_history: list[int]. Initialize current_block_size from the runtime config's block_size. No other state changes needed.

2

Add AdaptiveBlockPolicy and update_block_size() to spec_decode.py

Separate from AcceptancePolicy — they control different things. The update function must be pure (no side effects beyond mutating request) so it can be called directly in tests. Implement the EMA formula and three-tier mapping exactly.

3

Wire update_block_size into DecodeWorker.process()

Call after the verify step (and fallback serial step if triggered). Add is_under_pressure() to KVCache — returns True if live pages exceed 85% of max_pages. Replace the fixed block_size=self.block_size arg in propose() with request.current_block_size.

4

Add request_desc.h fields and update_block_size_device()

Two new fields in RequestDescriptor. One new __device__ __forceinline__ function in stage_spec_verify.cuh. Change stage_decode.cuh to read req->current_block_size. Remove block_size from the mega-kernel signature.

5

Write four targeted tests

Test: EMA rises on sustained acceptance. Test: EMA falls on rejection, block size reaches min_block. Test: KV pressure cap is enforced and never exceeded. Test: single rejection after 10 acceptances does not immediately collapse block size (EMA smoothing).

6

Update BenchmarkRunner and visualize block_size_history

Add mean_block_size and block_size_variance to benchmark output. Plot per-request block size over decode time for a mixed-profile batch. The convergence curves are the key validation artifact for the policy before Phase 3.


ASBS as the Foundation for On-Device Inference Intelligence

ASBS is the first in a family of innovations that become possible when inference control flow moves to the device. The EMA acceptance rate is a per-request signal that the device now owns continuously. What else can be built on it?

Priority-Aware KV Eviction

The KV cache already has priority fields. The eviction policy is currently pure LRU. A request with low EMA acceptance rate is a worse cache citizen than one with high acceptance — its draft pages are more likely to be discarded. Sorting eviction candidates by (priority ASC, ema_acceptance_rate ASC, lru_order) makes the eviction policy speculative-decode-aware with one additional field read per eviction candidate.

Request Grafting

Multiple requests sharing a long system prompt can share KV pages for that prefix instead of independently re-computing it. A new graft_source_id field in RequestDescriptor causes stage_prefill to skip the shared prefix and borrow the source request's pages directly. Reference-counting on top of the existing pinning mechanism prevents eviction of borrowed pages until all borrowers complete. This reduces prefill latency and cache pressure for the common system-prompt-heavy workload.

The Trajectory

Each of these innovations follows the same pattern as ASBS: a small on-device state addition, an inline update in one stage helper, and a corresponding Python change that preserves the simulator-as-specification guarantee. The architecture XL-Persistent-Kernel is building — clean stage helpers, explicit request descriptors, typed Python↔CUDA interface — is designed to absorb exactly these kinds of incremental innovations without architectural surgery.

"The persistent mega-kernel doesn't just eliminate launch overhead. It creates a new surface for inference intelligence — signals and decisions that previously required host round-trips now execute in nanoseconds, inline, on the device that is doing the work."

ASBS is the proof of concept for that surface. It costs a float and an int per request descriptor, a handful of arithmetic instructions per verify pass, and zero host overhead. In return, it makes the speculative decode loop self-tuning for the first time — adapting to the statistical reality of each individual request as it unfolds, entirely on the GPU, at the speed of computation rather than the speed of PCIe.