Follow-up systems note · v0.2

From Doorbells
to Rollouts

How gb300-rl-runtime v0.2 evolved from a close-to-metal idea into a measurable C/CUDA runtime skeleton for RL inference.

Core thesis

RL inference at scale is not just token serving. It is a state-machine runtime for generating, scoring, storing, and feeding back experience — while keeping accelerators continuously useful.

Rollout state machine §1
Six-queue pipeline §2
Command rings & fast path §3
Copy-on-write prefix KV §4
Tracing & latency taxonomy §5
Control-plane tax §6

rollout lifecycle

FREE · slot available

PREFILL_READY ↓ context loaded

DECODING ↓ token loop

REWARD_PENDING ↓ scoring boundary

TRAJECTORY_READY ↓ appending result

DONE · complete

Why this runtime exists

The per-token path must be protected.

The original premise was direct: for close-to-metal RL inference at GB300 scale, every per-token step should be free of page faults, malloc/free, syscalls, scheduler wakeups, CPU round-trips, and KV migration.

Traditional serving

Request → response

Inference is framed as a user request entering prefill and decode, then returning an answer. The unit of work is the request.

RL inference

Rollout → reward → update

The model generates experience, hits reward and verifier boundaries, stores trajectories, and feeds training loops. The unit of work is the rollout.

Stage 1

Rollout generation

Many concurrent streams decode tokens, actions, tool calls, or simulator steps.

→

Stage 2

Reward / verifier

Outputs are scored, validated, filtered, and converted into trajectory data.

→

Stage 3

Policy handoff

Completed trajectories feed updates and generate more inference work.

The hard part is not writing faster CUDA kernels. The hard part is designing the runtime around the kernels so the accelerator is never waiting on host orchestration.

v0.2 architecture

Rollouts become explicit state machines.

A rollout is not just a request. It is an object moving through a lifecycle: admission, decode, reward, trajectory, and completion. Naming the states gives the runtime a concrete unit of scheduling.

FREE

available slot

PREFILL

context loaded

DECODING

token loop active

REWARD

scoring boundary

TRAJECTORY

appending result

DONE

complete

C — rollout_t

typedef enum {
    ROLL_FREE             = 0,
    ROLL_PREFILL_READY,
    ROLL_DECODING,
    ROLL_REWARD_PENDING,
    ROLL_TRAJECTORY_READY,
    ROLL_DONE
} rollout_state_t;

typedef struct {
    uint32_t rollout_id;
    uint32_t state;
    uint32_t kv_arena_id;
    uint32_t kv_offset;
    uint32_t seq_len;
    uint32_t max_tokens;
    uint32_t reward_id;
    uint32_t flags;
} rollout_t;

C — atomic transition

int rollout_transition(rollout_t *r,
                       rollout_state_t from,
                       rollout_state_t to)
{
    uint32_t expected = (uint32_t)from;
    return __atomic_compare_exchange_n(
        &r->state, &expected, (uint32_t)to,
        0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE);
}

Design contract: rollout progression is explicit, atomic, and visible to the runtime. That makes scheduling, backpressure, tracing, and KV ownership easier to reason about — and easier to test.

Multi-queue pipeline

Six queues replace one generic scheduler.

v0.2 moves rollouts through six lock-free SPSC queues. RL inference becomes a flow of compact IDs — no heap allocations, no callback dispatch, no Python object lifecycle.

C — rollout_pipeline_t

typedef struct {
    ring_t  free_q;
    ring_t  prefill_q;
    ring_t  decode_q;
    ring_t  reward_q;
    ring_t  trajectory_q;
    ring_t  done_q;
} rollout_pipeline_t;

Visibility

Where is the work?

The runtime can answer which queue owns any rollout at any point in time.

Backpressure

Which stage is full?

Queue depth becomes a first-class signal for admission control and scheduling decisions.

Measurement

Where is the latency?

Tracepoints can measure time between queue boundaries — revealing exactly which stage hides p99.

flow control

Fast pipelines still need brakes. v0.2 adds credit-based backpressure so decode cannot outrun reward, and reward cannot outrun trajectory storage.

C — pipeline_credits_t

typedef struct {
    uint32_t decode_credits;
    uint32_t reward_credits;
    uint32_t trajectory_credits;
    uint32_t kv_blocks_free;
} pipeline_credits_t;

SCHED_FIFO

Baseline fairness

Simple and predictable. May not minimise memory pressure or p99 latency under skewed workloads.

Shortest remaining

Finish short work first

Reduces queue residency and can improve median completion latency under admission pressure.

Prefix sharing

Exploit KV locality

Prioritise rollouts that share prefix state, reducing duplicate KV pressure and increasing effective throughput.

Runtime fast path

Command rings publish work orders, not per-token instructions.

The CPU should not wake up and schedule every token. It should post compact descriptors and let persistent GPU workers advance execution without further host involvement.

C — doorbell fast path (3 lines)

q->entries[tail & RING_MASK] = desc;
__atomic_store_n(&q->tail, tail + 1, __ATOMIC_RELEASE);
*(volatile uint32_t *)gpu_doorbell = tail + 1;

The CPU writes an order ticket, places it on the queue, rings the bell, and walks away. The accelerator side picks it up on its own schedule.

C — cacheline-aware SPSC ring

#define RING_SIZE  (1u << 16)
#define RING_MASK  (RING_SIZE - 1)

struct ring {
    _Alignas(64) volatile uint32_t head;
    char pad1[60];

    _Alignas(64) volatile uint32_t tail;
    char pad2[60];

    struct rollout_desc entries[RING_SIZE];
};

Why the padding matters: head and tail must live on separate cachelines. A false-sharing miss between producer and consumer would add ~200 cycles per operation — defeating the entire purpose of a lock-free ring.

Avoid

CPU micromanagement

Wake CPU per token
Schedule every decode step from host
Allocate task objects at runtime
Enter kernel repeatedly via eventfd
Move KV ownership across threads

Prefer

Descriptor ownership

Post compact rollout descriptors once
Publish with acquire/release ordering
Ring GPU doorbell from coherent memory
Persistent worker drains queue autonomously
Completions observed in batches, not per-token

Memory runtime

Copy-on-write prefix KV makes rollouts memory-aware.

RL workloads routinely create many rollouts from the same prompt. Duplicating the full prefix KV cache for every branch wastes memory proportional to rollout count. v0.2 models a shared prefix plus per-rollout delta KV.

shared prompt prefix KV rollout A — delta KV rollout B — delta KV rollout C — delta KV rollout D — delta KV

C — kv_prefix_t / kv_branch_t

typedef struct {
    uint32_t prefix_id;
    uint32_t refcnt;
    uint32_t kv_arena_id;
    uint64_t kv_offset;
    uint32_t token_len;
    uint32_t flags;
} kv_prefix_t;

typedef struct {
    uint32_t rollout_id;
    uint32_t prefix_id;
    uint64_t delta_kv_offset;
    uint32_t delta_len;
} kv_branch_t;

Why this matters: the runtime is no longer treating the KV cache as anonymous memory. It understands that many rollout branches share a prefix and can schedule around that fact — batching rollouts with the same prefix, delaying eviction of hot prefixes, and allocating only delta storage per branch.

Measurement

Tracing turns the thesis into a latency taxonomy.

Counters show totals. Traces show where time goes. If the claim is "remove control-plane overhead," the runtime needs timestamps at stage boundaries — not just aggregate throughput.

Trace boundary	Question answered	Why it matters
descriptor posted → descriptor consumed	Is queue visibility slow?	Measures command-ring latency end-to-end.
reward posted → reward scored	Is reward the bottleneck?	Shows verifier backlog and p99 scoring delay.
decode complete → trajectory done	Is logging in the critical path?	Prevents trajectory storage from silently stalling decode.
completion visible → host observed	Is polling too coarse?	Measures completion observation delay on the CPU side.

C — trace_event_t

typedef struct {
    uint64_t ts_ns;
    uint32_t event_type;
    uint32_t rollout_id;
    uint32_t stage;
} trace_event_t;

The trace ring is separate from the pipeline rings — it is never in the hot path. Events are collected off-thread and reported as p50/p90/p99 pair latencies after each benchmark run.

Benchmarks

v0.2 makes the claims testable.

The milestone is not beating production inference stacks. The milestone is turning "avoid CPU round-trips" into queues, states, counters, guards, traces, and benchmarks that can be measured and falsified.

Benchmark	What it proves	Key signals
`bench-pipeline`	End-to-end rollout flow through all six queues	rollouts/sec, tokens/sec, queue depth
`bench-trace`	Latency between runtime stage boundaries	p50 / p90 / p99 for 8 event pairs
`bench-cow`	Memory savings from shared-prefix KV	memory saved %, alloc time, refcount overhead
`bench-tax`	Control-plane overhead comparison	eventfd vs userspace polling vs persistent worker
`bench-all`	Full suite in one command	reproducible snapshot for a given hardware config

Mode A

eventfd / syscall

Simple to implement. Pays full kernel entry and scheduler wakeup cost on every step.

Mode B

Userspace polling

Lower latency. Burns a CPU core and requires careful core ownership to avoid NUMA effects.

Mode C — this runtime

Persistent worker + ring

Keeps execution near the queue. No per-step host orchestration. No kernel entry.

v0.2 turns the idea from "avoid CPU round-trips" into queues, states, counters, guards, traces, and benchmarks that make the claim testable.

Next milestone

v0.3 pushes rollout progression onto the GPU.

The CPU should admit work and observe completions in batches. The GPU-side worker should progress rollouts, update KV, sample tokens, and post reward and completion descriptors — without a CPU round-trip per token.

CUDA — conceptual rollout_worker

__global__ void rollout_worker(
    device_ring_t *decode_q,
    device_ring_t *reward_q,
    device_ring_t *done_q,
    kv_arena_t   *kv)
{
    while (1) {
        rollout_desc_t r;
        if (!device_ring_pop(decode_q, &r)) continue;

        while (r.seq_len < r.max_tokens) {
            int tok = decode_one_token(kv, r.kv_offset, r.seq_len);
            append_kv(kv, r.kv_offset, r.seq_len, tok);
            r.seq_len++;

            if (needs_reward(tok)) {
                device_ring_push(reward_q, r);
                break;
            }
            if (is_done(tok)) {
                device_ring_push(done_q, r);
                break;
            }
        }
    }
}

v0.2

Measurable runtime skeleton

Rollout state, pipeline queues, hot-path guards, metrics, COW KV, reward handoff, tracing, backpressure, scheduling, and benchmarks — all on CPU.

v0.3

GPU-resident progression

Persistent device-side workers move descriptors forward. CPU handles admission, placement, and coarse-grained observation only.

Final takeaway

The runtime is the product.

The point is not simply that C is fast. The point is that at large-scale RL inference, the runtime decides where rollout state lives, who owns the KV cache, when reward work runs, which queues are saturated, which rollouts are scheduled next, and where p99 latency is hiding.

v0.1 established the primitives. v0.2 made the runtime measurable. v0.3 should push rollout progression onto the GPU. That is the path from doorbells to rollouts.

From Doorbellsto Rollouts

The per-token path must be protected.

Request → response

Rollout → reward → update

Rollout generation

Reward / verifier

Policy handoff

Rollouts become explicit state machines.

Six queues replace one generic scheduler.

Where is the work?

Which stage is full?

Where is the latency?

Baseline fairness

Finish short work first

Exploit KV locality

Command rings publish work orders, not per-token instructions.

CPU micromanagement

Descriptor ownership

Copy-on-write prefix KV makes rollouts memory-aware.

Tracing turns the thesis into a latency taxonomy.

v0.2 makes the claims testable.

eventfd / syscall

Userspace polling

Persistent worker + ring

v0.3 pushes rollout progression onto the GPU.

Measurable runtime skeleton

GPU-resident progression

The runtime is the product.

From Doorbells
to Rollouts