Conceptual sketch

Diffusion-Style Token Refinement
on a Persistent Mega-Kernel

Mapping the parallel refinement pipeline of diffusion-based language models to a single GPU-resident control loop — same architectural pattern, different generation paradigm.

Filed under cuda/examples/ · Built for systems discussion, not reproduction

Why diffusion-style generation?

Autoregressive decoding dominates LLM serving today: generate one token at a time, left to right, each conditioned on all previous tokens. But a parallel body of research explores diffusion-based language modeling, where the model starts with a canvas of random tokens and iteratively refines them in parallel — denoising toward a coherent sequence. This is the approach behind models like Diffusion-LM, SSD-LM, and Google's DiffusionGemma.

From a systems perspective, the difference is stark. Autoregressive generation is inherently serial — one token per step, fixed compute per step. Diffusion-style generation is inherently parallel — the entire canvas is processed in each refinement step. But the orchestration problem remains the same: do we drive these steps from a host-side loop, launching a new kernel each time, or do we keep the control loop on the GPU, eliminating round-trips?

"The generation paradigm changes, but the orchestration bottleneck is identical. That makes diffusion-style inference a natural second application for the persistent mega-kernel thesis."

This sketch (cuda/examples/diffusion_refinement_megakernel_sketch.cu) is a conceptual mapping — not a working diffusion model, but a structural demonstration of how a persistent mega-kernel adapts to a parallel refinement pipeline.


The resident refinement loop

The kernel follows the same pattern as the repo's main xl_persistent_megakernel.cu: one thread block per request, a persistent while loop on device, and a shared shutdown flag. But instead of a linear prefill → decode → verify pipeline, the stages are mapped to diffusion-style refinement operations:

while (!(*shutdown) && !r->done) {
    denoise_canvas_step(r, canvas);
    update_confidence_mask(r, canvas);
    verify_or_resample(r, canvas);
    commit_ready_tokens(r, canvas);
    update_resident_state(r, state);

    if (r->step >= r->max_steps || r->stable_tokens >= r->canvas_len)
        r->done = 1;

    __syncthreads();

    if (idx == 0 && all_done(reqs, num_reqs))
        *shutdown = 1;
}

Each stage is a separate device helper with deterministic fake math — the logic is a placeholder, but the control flow is the point.


Stage by stage

Five stages in each refinement step, each mapped to a device function. The naming convention mirrors the diffusion literature while the structure mirrors the repo's existing mega-kernel pattern.

1. denoise_canvas_step

Incrementally updates every token on the canvas toward a less-noisy state. In a real implementation this would invoke a diffusion model forward pass over the entire canvas. Here, each token is deterministically perturbed by (token + request_id + step) % 32000 — a no-op placeholder that preserves the loop structure.

denoise_canvas_step(DiffusionRequest* r, Canvas* c)

2. update_confidence_mask

Assigns a confidence score to each canvas position. In a diffusion pipeline, confidence corresponds to how close a token is to its final predicted value — low-confidence positions may need more refinement steps. Here, confidence is a simple linear ramp: step / max_steps, capped at 0.99 beyond 95% progress.

update_confidence_mask(DiffusionRequest* r, Canvas* c)

3. verify_or_resample

Positions whose confidence is below 0.5 are resampled with a deterministic pseudorandom function. This mirrors the diffusion process where low-confidence tokens are re-predicted by the model. In a speculative-decoding context, this stage is analogous to rejection sampling against a draft proposal.

verify_or_resample(DiffusionRequest* r, Canvas* c)

4. commit_ready_tokens

Tokens with confidence above 0.9 are counted as stable/committed. This is the diffusion equivalent of committing accepted speculation tokens to the KV cache. When the number of stable tokens equals the full canvas length, the request is done.

commit_ready_tokens(DiffusionRequest* r, Canvas* c)

5. update_resident_state

Persists the current step number to the GPU-resident metadata array — matching the update_resident_state pattern from the autoregressive kernel. This state is available for host inspection or multi-GPU coordination without additional kernel launches.

update_resident_state(DiffusionRequest* r, ResidentState* s)

Autoregressive vs. diffusion mapping

The structural symmetry between the two generation paradigms is visible when you compare their stage pipelines:

Autoregressive (main kernel) Diffusion (this sketch) Role
prefill denoise Generate or refine the next state of the sequence
decode update_confidence Evaluate current state quality
spec_verify verify_or_resample Accept, reject, or re-predict low-quality positions
commit commit_ready_tokens Mark positions as finalized
kv state update update_resident_state Persist request-level metadata on device

The pipeline shapes differ — autoregressive stages are serial (one token at a time), diffusion stages are parallel (the whole canvas) — but the resident loop architecture is identical. Same shutdown protocol, same block-per-request pattern, same elimination of CPU round-trips between stages.


Key design elements

Block-per-request topology

Each thread block claims one DiffusionRequest and its associated Canvas. This matches the autoregressive kernel's block-per-request layout exactly. The total number of blocks equals the batch size, and each block loops until its request reaches max_steps or the entire canvas is stable.

Resident shutdown with block-0 responsibility

Block 0 checks all_done() after every iteration and sets the shared shutdown flag. This is the same pattern used in xl_persistent_megakernel.cu — one clean coordination primitive, no host-side polling, no atomic arbitration across blocks.

128:1 reduction. If a diffusion refinement takes 128 steps and each step is launched separately from the host, that's 128 kernel launches and 128 host-device synchronizations. With the persistent mega-kernel, it's one launch and one synchronization — regardless of step count. The orchestration savings grow with the number of refinement steps.

Fake math by design

All arithmetic is deterministic and stateless — no random number generation, no model weights, no attention mechanism. The point is to demonstrate the control flow scaffold at zero implementation cost. A real diffusion model would plug real math into each stage without changing the loop structure, the shutdown protocol, or the block topology.


What this is not.

This file is not an implementation of DiffusionGemma.
This file is not compatible with Google's DiffusionGemma implementation.
This file is not a working diffusion language model.

It is a systems-level mapping sketch — a conceptual demonstration of how the persistent mega-kernel pattern (one resident GPU loop, many logical stages) applies to diffusion-style parallel token refinement. All math is fake. The purpose is structural comparison and architectural discussion.

The common thesis

Whether generation is autoregressive (one token at a time) or diffusion-based (refine the whole canvas in parallel), the orchestration problem is the same: how many host round-trips does your inference pipeline require per request, and can you eliminate them?

The persistent mega-kernel says: move the control loop to the GPU. Keep the device resident with a persistent while-loop, sequence stages as inline device functions, and communicate coordination state (done flags, shutdown) through device-visible memory. The host launches once and synchronizes once — regardless of how many stages or steps the pipeline requires.

"The generation paradigm changes. The orchestration savings do not."

This is the core insight that makes the sketch worth reading alongside the main repo. The autoregressive and diffusion mappings are two faces of the same architectural decision: keep the loop on device, and let the host return to doing what it does best — scheduling, memory management, and high-level coordination — rather than micro-managing every step of generation.