deep systems · LLM inference · GPU architecture
// xl-persistent-kernel

One Kernel.
Zero Round-Trips.

How a persistent mega-kernel eliminates the per-token launch tax in large language model serving — from CPU state machine to CUDA scaffold.

GPU Systems
LLM Inference
# Speculative Decode
↗ View on GitHub
Code: github.com/manishklach/gpu-resident-inference-lab

LLM serving has a launch overhead problem

Every token a large language model generates costs two things: the compute to actually run the transformer, and the orchestration overhead to ask it to do so. At small batch sizes — the regime that matters most for interactive inference — orchestration overhead can become a first-order latency term.

The conventional decode loop looks like this:

traditional host-launched decode pseudocode
// CPU controls every token step
for each token:
    launch_kernel(attention)       // ~5–10 µs overhead
    cudaDeviceSynchronize()        // host stalls, waits for GPU
    launch_kernel(projection)      // another 5–10 µs
    cudaDeviceSynchronize()
    launch_kernel(sampling)        // another 5–10 µs
    cudaDeviceSynchronize()
    inspect_result()               // PCIe read back to host

For a 128-token response, that's hundreds of kernel launches and synchronizations. Each one incurs host-device round-trip latency, PCIe traffic, and memory fence costs. The GPU sits idle while the CPU decides what to do next.

Key insight: In some serving configurations, launch and synchronization overhead becomes large enough to affect p50 and p99 latency, especially for small-batch decode. The GPU is fast. The decision loop around it is not.

What if the GPU never came home?

The insight behind GPU Resident Inference Lab — and the production systems that inspired it, including Mirage/MPK (OSDI '25) and TileRT — is that the token loop belongs on the device, not the host.

"Launch once. Let the GPU own the loop. The host just polls for completion."

A persistent mega-kernel is a single GPU kernel launch that never returns to the host until all requests are done. Instead of the CPU driving each decode step, the kernel carries an internal loop: it prefills, decodes, speculatively verifies draft tokens, commits accepted output, and advances to the next iteration — all on-device, with no host involvement until the work is complete.

persistent mega-kernel (xl_persistent_megakernel.cu) CUDA C++
// Host launches once. GPU owns the rest.
xl_persistent_megakernel<<<N, block>>>(
    requests, N, kv_table, draft_tokens,
    &shutdown_flag, max_iterations, block_size
);
// ↑ One launch. One sync. N tokens × M requests.

// Inside the kernel — device loop, no host round-trips:
for (int iter = 0; !(*shutdown_flag) && iter < max_iterations; iter++) {
    stage_prefill(req, &kv_table);       // PREFILL_READY → DECODE_READY
    stage_decode(req, draft, block_size); // propose draft block
    stage_spec_verify(req, draft);        // accept/reject prefix
    stage_commit(req, &kv_table);         // commit accepted tokens
}

The entire inference pipeline — scheduling, prefill, decode, speculative verification, KV page management, commit — becomes a single resident loop on the GPU. The orchestration overhead collapses from O(tokens) launches to exactly one.

Architecture flow

request flow: host → GPU mega-kernel → completion diagram
CPU request submit
        ↓
Request descriptors (device memory)
        ↓
Persistent GPU mega-kernel
        ↓
  ┌─────────────────────────────────┐
  │  stage_prefill()   inline       │
  │  stage_decode()    inline       │
  │  stage_spec_verify() inline     │
  │  stage_commit()    inline       │
  │  stage_kv()        inline       │
  └─────────────────────────────────┘
        ↓
Completion queue (device memory)
        ↓
CPU polls completion

Execution model comparison

Path Host launches Host syncs Control owner Execution model
Baseline decode O(tokens) O(tokens) CPU Repeated kernel launches
Persistent mega-kernel 1 1 (final) GPU Resident control loop

Building the right scaffold first

GPU Resident Inference Lab takes a principled approach: before writing real transformer math in CUDA, build the control-flow skeleton precisely. Get the state machine right. Know exactly which buffers are authoritative, when draft tokens become committed tokens, and what must remain on-device in a real persistent kernel.

Many logical stages, one resident kernel. GPU Resident Inference Lab is not a bag of independent CUDA kernels. The opposite is the point. The repo models prefill, decode, speculative verification, commit, and KV lifecycle management as logical stages inside one persistent GPU mega-kernel. The stage helper files exist for readability, but the execution model is one resident kernel that keeps request state and control flow on the GPU.

The repo is organized in two layers that mirror how the final system will be built:

Layer 1 — Python CPU simulator

A complete, runnable Python simulator in src/megakernel_lab/ that models the full control flow without touching CUDA. This is not a throwaway prototype — it's a correctness specification.

The simulator implements PrefillWorker and DecodeWorker with explicit handoff, a paged KV-cache with LRU eviction and pinning, a speculative block proposer and verifier, and a CPUStubBackend that satisfies the same AbstractKernelBackend interface that future CUDA kernels will implement. Swap the stub for real GPU kernels and the runtime doesn't change.

kv_cache.py — paged KV with memory accounting Python
def evict_lru(self, n_pages: int) -> list[int]:
    """Draft pages evicted before committed. Pinned pages never evicted."""
    evicted: list[int] = []
    # First pass: prefer draft (speculative, discardable)
    for page_id in list(self._lru.keys()):
        if len(evicted) >= n_pages: break
        page = self._pages[page_id]
        if page.pinned or not page.is_draft: continue
        self._evict_single_page(page_id)
        evicted.append(page_id)
    # Second pass: committed pages if still needed
    for page_id in list(self._lru.keys()):
        if len(evicted) >= n_pages: break
        page = self._pages[page_id]
        if page.pinned or page.is_draft: continue
        self._evict_single_page(page_id)
        evicted.append(page_id)
    return evicted

The KV cache distinguishes between committed pages (finalized token data, protected) and draft pages (speculative, discardable on rejection). Active-decode pages are pinned and never evicted. The residency_report() method exposes live bytes, pinned bytes, evicted bytes, and fragmentation ratio — the same metrics a CUDA implementation will eventually need to expose to a host-side memory manager.

Layer 2 — CUDA control-flow scaffold

In cuda/, the repo builds the GPU-side skeleton. No real transformer math yet — token generation is a deterministic offset formula — but the architectural shape is correct.

The stage helpers (stage_prefill.cuh, stage_decode.cuh, stage_spec_verify.cuh, stage_commit.cuh, stage_kv.cuh) are __forceinline__ __device__ functions — they are not separately launched kernels. They compile inline into the one persistent mega-kernel. This is the critical architectural choice: many logical stages, one resident GPU loop.

The request lifecycle maps to 9 device-side states with 6 flag bits. The KV page table tracks 5 page states. Every state transition that the Python simulator validated is mirrored exactly at the CUDA level.

Measurement note: The current CUDA scaffold does not measure real transformer math, model quality, or production LLM throughput. It measures orchestration structure: host launch count, host synchronization count, request lifecycle progress, and the difference between a CPU-driven token loop and one GPU-resident mega-kernel launch. All token generation uses fake deterministic math.

Measurement harness

host_launcher.cpp is the measurement centerpiece. It runs both the baseline host-launched path and the persistent mega-kernel path, times them with CUDA events, and exports a CSV with host_kernel_launches, host_synchronizations, elapsed_ms, and tokens_per_second. Run it with make cuda-smoke — no GPU required; the target gracefully skips if nvcc is absent.

Expected output on GPU hardware:

Baseline: host_kernel_launches = 128  |  host_synchronizations = 128
Mega-kernel: host_kernel_launches = 1  |  host_synchronizations = 1

The first measurable win is not model FLOPs. It's the 128:1 reduction in orchestration overhead.

What is real today vs. future

Component Today Future
Python simulator Lifecycle correctness model Scheduling experiments
CUDA mega-kernel Fake deterministic control flow Real fused inference path
Stage helpers Inline logical stages Real prefill/decode/verify logic
KV metadata Fake page lifecycle Real KV tensors and movement
CUDA measurement Launch/sync counts Profiler-backed latency analysis

Draft, verify, commit — on-device

One reason the persistent kernel architecture is particularly compelling is its interaction with speculative decoding. In a conventional host-driven loop, running a draft model and then verifying against the target model still requires multiple launch/sync cycles per token block. In a mega-kernel, the entire draft → verify → commit loop is a tight on-device cycle.

The simulator's DraftBlockProposer generates a block of candidate tokens. The SpeculativeVerifier computes an acceptance mask. Accepted tokens are committed to the KV cache; rejected tokens trigger either a fallback to serial decode or a fresh draft. All of this is a single tight loop — no host intervention between proposal and commit.

The Python benchmark harness sweeps batch sizes from 1 to 32 and block sizes from 1 to 8, reporting TTFT, inter-token latency percentiles (p50/p95/p99), and speculative acceptance rate. The kv_pressure mode intentionally undersizes the page pool to drive evictions and measure the fragmentation cost. These baselines will anchor Phase 3 when real transformer math arrives.


Roadmap to a real fused kernel

The repo follows a deliberate phase structure. Each phase de-risks the next by validating the state machine at a lower cost before committing to harder implementation work.

Phase 1
complete

CPU control-flow simulator

Full Python runtime with prefill/decode workers, paged KV cache, speculative block flow, backend abstraction, memory accounting, benchmark harness, and 25+ tests.

Phase 2A
in progress

CUDA control-flow scaffold

Persistent mega-kernel with fake math, request descriptors, KV page table, all 6 stage helpers, baseline comparison kernel, CLI measurement harness with CUDA event timing.

Phase 2B
in progress

Measured orchestration overhead

Sweep harness across (requests × tokens × draft_len) configs, CSV export with launch/sync reduction ratios, speedup column, make cuda-bench targets, summarize script.

Phase 2C
planned

NVTX / profiler visibility

NVTX range annotations around baseline loop and mega-kernel launch. Nsight Systems trace documentation showing one large range (mega-kernel) vs many small ranges (baseline).

Phase 3
planned

Real fused decode/verify kernels

Replace stubs with actual fused attention, projection, sampling, KV tensors, speculative verification, and continuous batching.

Phase 4–5
planned

Multi-GPU / NVLink / communication overlap

Tensor and pipeline parallelism, dynamic request admission, NVLink communication overlap, load balancing across devices.


Deep dives

Detailed posts exploring specific ideas in the repository.

Diffusion-Style Token Refinement on a Persistent Mega-Kernel

A stage-by-stage walkthrough of cuda/examples/diffusion_refinement_megakernel_sketch.cu — mapping the parallel refinement pipeline of diffusion-based language models to a single GPU-resident control loop. Five-stage breakdown, autoregressive comparison table, and the common orchestration thesis.


Context: the 1T-class serving problem

For 1T-class models, especially sparse or MoE systems, throughput is not limited only by FLOPs. It is also limited by orchestration: token-by-token launch overhead, fragmented decode stages, KV-cache residency, inter-GPU communication, and speculative verification/commit overhead. A persistent mega-kernel is one execution-control technique for pushing these systems toward 1K+ tokens/sec when combined with MoE, quantization, speculative decoding, paged KV cache, continuous batching, and multi-GPU communication overlap.

Recent mega-kernel research such as Mirage Persistent Kernel shows that this pattern is not theoretical: compiling multi-GPU model inference into a single mega-kernel can reduce end-to-end latency by reducing host orchestration and improving cross-stage scheduling. Related work such as TileRT demonstrates production-adjacent implementations of this approach.

"Before writing a giant CUDA kernel, be precise about which buffers are authoritative, when draft tokens become committed tokens, and what state must remain on device. That control flow is what this repo captures first."

That discipline — get the state machine right before spending cycles on the hardware path — is what makes this codebase worth following. The Python simulator is not a stepping stone to be discarded; it's a living correctness specification. Every behavior the CUDA scaffold implements was validated in Python first.

"LLM inference is not just a math problem anymore. At extreme scale, it becomes a scheduling, residency, and orchestration problem. GPU Resident Inference Lab explores what happens when that orchestration moves from the CPU into a persistent GPU-resident execution loop."