LLM serving has a launch overhead problem
Every token a large language model generates costs two things: the compute to actually run the transformer, and the orchestration overhead to ask it to do so. At small batch sizes — the regime that matters most for interactive inference — orchestration overhead can become a first-order latency term.
The conventional decode loop looks like this:
// CPU controls every token step
for each token:
launch_kernel(attention) // ~5–10 µs overhead
cudaDeviceSynchronize() // host stalls, waits for GPU
launch_kernel(projection) // another 5–10 µs
cudaDeviceSynchronize()
launch_kernel(sampling) // another 5–10 µs
cudaDeviceSynchronize()
inspect_result() // PCIe read back to host
For a 128-token response, that's hundreds of kernel launches and synchronizations. Each one incurs host-device round-trip latency, PCIe traffic, and memory fence costs. The GPU sits idle while the CPU decides what to do next.
What if the GPU never came home?
The insight behind GPU Resident Inference Lab — and the production systems that inspired it, including Mirage/MPK (OSDI '25) and TileRT — is that the token loop belongs on the device, not the host.
"Launch once. Let the GPU own the loop. The host just polls for completion."
A persistent mega-kernel is a single GPU kernel launch that never returns to the host until all requests are done. Instead of the CPU driving each decode step, the kernel carries an internal loop: it prefills, decodes, speculatively verifies draft tokens, commits accepted output, and advances to the next iteration — all on-device, with no host involvement until the work is complete.
// Host launches once. GPU owns the rest.
xl_persistent_megakernel<<<N, block>>>(
requests, N, kv_table, draft_tokens,
&shutdown_flag, max_iterations, block_size
);
// ↑ One launch. One sync. N tokens × M requests.
// Inside the kernel — device loop, no host round-trips:
for (int iter = 0; !(*shutdown_flag) && iter < max_iterations; iter++) {
stage_prefill(req, &kv_table); // PREFILL_READY → DECODE_READY
stage_decode(req, draft, block_size); // propose draft block
stage_spec_verify(req, draft); // accept/reject prefix
stage_commit(req, &kv_table); // commit accepted tokens
}
The entire inference pipeline — scheduling, prefill, decode, speculative verification, KV page management,
commit — becomes a single resident loop on the GPU. The orchestration overhead collapses from
O(tokens) launches to exactly one.
Architecture flow
CPU request submit
↓
Request descriptors (device memory)
↓
Persistent GPU mega-kernel
↓
┌─────────────────────────────────┐
│ stage_prefill() inline │
│ stage_decode() inline │
│ stage_spec_verify() inline │
│ stage_commit() inline │
│ stage_kv() inline │
└─────────────────────────────────┘
↓
Completion queue (device memory)
↓
CPU polls completion
Execution model comparison
| Path | Host launches | Host syncs | Control owner | Execution model |
|---|---|---|---|---|
| Baseline decode | O(tokens) | O(tokens) | CPU | Repeated kernel launches |
| Persistent mega-kernel | 1 | 1 (final) | GPU | Resident control loop |
Building the right scaffold first
GPU Resident Inference Lab takes a principled approach: before writing real transformer math in CUDA, build the control-flow skeleton precisely. Get the state machine right. Know exactly which buffers are authoritative, when draft tokens become committed tokens, and what must remain on-device in a real persistent kernel.
The repo is organized in two layers that mirror how the final system will be built:
Layer 1 — Python CPU simulator
A complete, runnable Python simulator in src/megakernel_lab/ that models the full
control flow without touching CUDA. This is not a throwaway prototype — it's a correctness specification.
The simulator implements PrefillWorker and DecodeWorker
with explicit handoff, a paged KV-cache with LRU eviction and pinning, a speculative block proposer and verifier,
and a CPUStubBackend that satisfies the same AbstractKernelBackend
interface that future CUDA kernels will implement. Swap the stub for real GPU kernels and the runtime doesn't change.
def evict_lru(self, n_pages: int) -> list[int]:
"""Draft pages evicted before committed. Pinned pages never evicted."""
evicted: list[int] = []
# First pass: prefer draft (speculative, discardable)
for page_id in list(self._lru.keys()):
if len(evicted) >= n_pages: break
page = self._pages[page_id]
if page.pinned or not page.is_draft: continue
self._evict_single_page(page_id)
evicted.append(page_id)
# Second pass: committed pages if still needed
for page_id in list(self._lru.keys()):
if len(evicted) >= n_pages: break
page = self._pages[page_id]
if page.pinned or page.is_draft: continue
self._evict_single_page(page_id)
evicted.append(page_id)
return evicted
The KV cache distinguishes between committed pages (finalized token data, protected)
and draft pages (speculative, discardable on rejection). Active-decode pages are pinned and
never evicted. The residency_report() method exposes live bytes, pinned bytes,
evicted bytes, and fragmentation ratio — the same metrics a CUDA implementation will eventually need to
expose to a host-side memory manager.
Layer 2 — CUDA control-flow scaffold
In cuda/, the repo builds the GPU-side skeleton. No real transformer math yet
— token generation is a deterministic offset formula — but the architectural shape is correct.
The stage helpers (stage_prefill.cuh, stage_decode.cuh,
stage_spec_verify.cuh, stage_commit.cuh,
stage_kv.cuh) are __forceinline__ __device__ functions —
they are not separately launched kernels. They compile inline into the one persistent mega-kernel. This is the
critical architectural choice: many logical stages, one resident GPU loop.
The request lifecycle maps to 9 device-side states with 6 flag bits. The KV page table tracks 5 page states. Every state transition that the Python simulator validated is mirrored exactly at the CUDA level.
Measurement harness
host_launcher.cpp is the measurement centerpiece. It runs both the baseline
host-launched path and the persistent mega-kernel path, times them with CUDA events, and exports a CSV
with host_kernel_launches, host_synchronizations,
elapsed_ms, and tokens_per_second. Run it
with make cuda-smoke — no GPU required; the target gracefully skips if
nvcc is absent.
Baseline: host_kernel_launches = 128 | host_synchronizations = 128
Mega-kernel: host_kernel_launches = 1 | host_synchronizations = 1
The first measurable win is not model FLOPs. It's the 128:1 reduction in orchestration overhead.
What is real today vs. future
| Component | Today | Future |
|---|---|---|
| Python simulator | Lifecycle correctness model | Scheduling experiments |
| CUDA mega-kernel | Fake deterministic control flow | Real fused inference path |
| Stage helpers | Inline logical stages | Real prefill/decode/verify logic |
| KV metadata | Fake page lifecycle | Real KV tensors and movement |
| CUDA measurement | Launch/sync counts | Profiler-backed latency analysis |
Draft, verify, commit — on-device
One reason the persistent kernel architecture is particularly compelling is its interaction with speculative decoding. In a conventional host-driven loop, running a draft model and then verifying against the target model still requires multiple launch/sync cycles per token block. In a mega-kernel, the entire draft → verify → commit loop is a tight on-device cycle.
The simulator's DraftBlockProposer generates a block of candidate tokens.
The SpeculativeVerifier computes an acceptance mask. Accepted tokens are
committed to the KV cache; rejected tokens trigger either a fallback to serial decode or a fresh draft.
All of this is a single tight loop — no host intervention between proposal and commit.
The Python benchmark harness sweeps batch sizes from 1 to 32 and block sizes from 1 to 8, reporting
TTFT, inter-token latency percentiles (p50/p95/p99), and speculative acceptance rate. The
kv_pressure mode intentionally undersizes the page pool to drive evictions
and measure the fragmentation cost. These baselines will anchor Phase 3 when real transformer math arrives.
Roadmap to a real fused kernel
The repo follows a deliberate phase structure. Each phase de-risks the next by validating the state machine at a lower cost before committing to harder implementation work.
CPU control-flow simulator
Full Python runtime with prefill/decode workers, paged KV cache, speculative block flow, backend abstraction, memory accounting, benchmark harness, and 25+ tests.
CUDA control-flow scaffold
Persistent mega-kernel with fake math, request descriptors, KV page table, all 6 stage helpers, baseline comparison kernel, CLI measurement harness with CUDA event timing.
Measured orchestration overhead
Sweep harness across (requests × tokens × draft_len) configs, CSV export with launch/sync reduction ratios, speedup column, make cuda-bench targets, summarize script.
NVTX / profiler visibility
NVTX range annotations around baseline loop and mega-kernel launch. Nsight Systems trace documentation showing one large range (mega-kernel) vs many small ranges (baseline).
Real fused decode/verify kernels
Replace stubs with actual fused attention, projection, sampling, KV tensors, speculative verification, and continuous batching.
Multi-GPU / NVLink / communication overlap
Tensor and pipeline parallelism, dynamic request admission, NVLink communication overlap, load balancing across devices.
Deep dives
Detailed posts exploring specific ideas in the repository.
Diffusion-Style Token Refinement on a Persistent Mega-Kernel
A stage-by-stage walkthrough of cuda/examples/diffusion_refinement_megakernel_sketch.cu — mapping the parallel refinement pipeline of diffusion-based language models to a single GPU-resident control loop. Five-stage breakdown, autoregressive comparison table, and the common orchestration thesis.
Context: the 1T-class serving problem
For 1T-class models, especially sparse or MoE systems, throughput is not limited only by FLOPs. It is also limited by orchestration: token-by-token launch overhead, fragmented decode stages, KV-cache residency, inter-GPU communication, and speculative verification/commit overhead. A persistent mega-kernel is one execution-control technique for pushing these systems toward 1K+ tokens/sec when combined with MoE, quantization, speculative decoding, paged KV cache, continuous batching, and multi-GPU communication overlap.
Recent mega-kernel research such as Mirage Persistent Kernel shows that this pattern is not theoretical: compiling multi-GPU model inference into a single mega-kernel can reduce end-to-end latency by reducing host orchestration and improving cross-stage scheduling. Related work such as TileRT demonstrates production-adjacent implementations of this approach.
"Before writing a giant CUDA kernel, be precise about which buffers are authoritative, when draft tokens become committed tokens, and what state must remain on device. That control flow is what this repo captures first."
That discipline — get the state machine right before spending cycles on the hardware path — is what makes this codebase worth following. The Python simulator is not a stepping stone to be discarded; it's a living correctness specification. Every behavior the CUDA scaffold implements was validated in Python first.
"LLM inference is not just a math problem anymore. At extreme scale, it becomes a scheduling, residency, and orchestration problem. GPU Resident Inference Lab explores what happens when that orchestration moves from the CPU into a persistent GPU-resident execution loop."