From Doorbells
to Rollouts
How gb300-rl-runtime v0.2 evolved from a close-to-metal idea into a measurable C/CUDA runtime skeleton for RL inference.
RL inference at scale is not just token serving. It is a state-machine runtime for generating, scoring, storing, and feeding back experience — while keeping accelerators continuously useful.
The per-token path must be protected.
The original premise was direct: for close-to-metal RL inference at GB300 scale, every per-token step should be free of page faults, malloc/free, syscalls, scheduler wakeups, CPU round-trips, and KV migration.
Request → response
Inference is framed as a user request entering prefill and decode, then returning an answer. The unit of work is the request.
Rollout → reward → update
The model generates experience, hits reward and verifier boundaries, stores trajectories, and feeds training loops. The unit of work is the rollout.
Rollout generation
Many concurrent streams decode tokens, actions, tool calls, or simulator steps.
Reward / verifier
Outputs are scored, validated, filtered, and converted into trajectory data.
Policy handoff
Completed trajectories feed updates and generate more inference work.
The hard part is not writing faster CUDA kernels. The hard part is designing the runtime around the kernels so the accelerator is never waiting on host orchestration.
Rollouts become explicit state machines.
A rollout is not just a request. It is an object moving through a lifecycle: admission, decode, reward, trajectory, and completion. Naming the states gives the runtime a concrete unit of scheduling.
typedef enum {
ROLL_FREE = 0,
ROLL_PREFILL_READY,
ROLL_DECODING,
ROLL_REWARD_PENDING,
ROLL_TRAJECTORY_READY,
ROLL_DONE
} rollout_state_t;
typedef struct {
uint32_t rollout_id;
uint32_t state;
uint32_t kv_arena_id;
uint32_t kv_offset;
uint32_t seq_len;
uint32_t max_tokens;
uint32_t reward_id;
uint32_t flags;
} rollout_t;
int rollout_transition(rollout_t *r,
rollout_state_t from,
rollout_state_t to)
{
uint32_t expected = (uint32_t)from;
return __atomic_compare_exchange_n(
&r->state, &expected, (uint32_t)to,
0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE);
}
Six queues replace one generic scheduler.
v0.2 moves rollouts through six lock-free SPSC queues. RL inference becomes a flow of compact IDs — no heap allocations, no callback dispatch, no Python object lifecycle.
typedef struct {
ring_t free_q;
ring_t prefill_q;
ring_t decode_q;
ring_t reward_q;
ring_t trajectory_q;
ring_t done_q;
} rollout_pipeline_t;
Where is the work?
The runtime can answer which queue owns any rollout at any point in time.
Which stage is full?
Queue depth becomes a first-class signal for admission control and scheduling decisions.
Where is the latency?
Tracepoints can measure time between queue boundaries — revealing exactly which stage hides p99.
Fast pipelines still need brakes. v0.2 adds credit-based backpressure so decode cannot outrun reward, and reward cannot outrun trajectory storage.
typedef struct {
uint32_t decode_credits;
uint32_t reward_credits;
uint32_t trajectory_credits;
uint32_t kv_blocks_free;
} pipeline_credits_t;
Baseline fairness
Simple and predictable. May not minimise memory pressure or p99 latency under skewed workloads.
Finish short work first
Reduces queue residency and can improve median completion latency under admission pressure.
Exploit KV locality
Prioritise rollouts that share prefix state, reducing duplicate KV pressure and increasing effective throughput.
Command rings publish work orders, not per-token instructions.
The CPU should not wake up and schedule every token. It should post compact descriptors and let persistent GPU workers advance execution without further host involvement.
q->entries[tail & RING_MASK] = desc;
__atomic_store_n(&q->tail, tail + 1, __ATOMIC_RELEASE);
*(volatile uint32_t *)gpu_doorbell = tail + 1;
The CPU writes an order ticket, places it on the queue, rings the bell, and walks away. The accelerator side picks it up on its own schedule.
#define RING_SIZE (1u << 16)
#define RING_MASK (RING_SIZE - 1)
struct ring {
_Alignas(64) volatile uint32_t head;
char pad1[60];
_Alignas(64) volatile uint32_t tail;
char pad2[60];
struct rollout_desc entries[RING_SIZE];
};
CPU micromanagement
- Wake CPU per token
- Schedule every decode step from host
- Allocate task objects at runtime
- Enter kernel repeatedly via eventfd
- Move KV ownership across threads
Descriptor ownership
- Post compact rollout descriptors once
- Publish with acquire/release ordering
- Ring GPU doorbell from coherent memory
- Persistent worker drains queue autonomously
- Completions observed in batches, not per-token
Copy-on-write prefix KV makes rollouts memory-aware.
RL workloads routinely create many rollouts from the same prompt. Duplicating the full prefix KV cache for every branch wastes memory proportional to rollout count. v0.2 models a shared prefix plus per-rollout delta KV.
typedef struct {
uint32_t prefix_id;
uint32_t refcnt;
uint32_t kv_arena_id;
uint64_t kv_offset;
uint32_t token_len;
uint32_t flags;
} kv_prefix_t;
typedef struct {
uint32_t rollout_id;
uint32_t prefix_id;
uint64_t delta_kv_offset;
uint32_t delta_len;
} kv_branch_t;
Tracing turns the thesis into a latency taxonomy.
Counters show totals. Traces show where time goes. If the claim is "remove control-plane overhead," the runtime needs timestamps at stage boundaries — not just aggregate throughput.
| Trace boundary | Question answered | Why it matters |
|---|---|---|
| descriptor posted → descriptor consumed | Is queue visibility slow? | Measures command-ring latency end-to-end. |
| reward posted → reward scored | Is reward the bottleneck? | Shows verifier backlog and p99 scoring delay. |
| decode complete → trajectory done | Is logging in the critical path? | Prevents trajectory storage from silently stalling decode. |
| completion visible → host observed | Is polling too coarse? | Measures completion observation delay on the CPU side. |
typedef struct {
uint64_t ts_ns;
uint32_t event_type;
uint32_t rollout_id;
uint32_t stage;
} trace_event_t;
The trace ring is separate from the pipeline rings — it is never in the hot path. Events are collected off-thread and reported as p50/p90/p99 pair latencies after each benchmark run.
v0.2 makes the claims testable.
The milestone is not beating production inference stacks. The milestone is turning "avoid CPU round-trips" into queues, states, counters, guards, traces, and benchmarks that can be measured and falsified.
| Benchmark | What it proves | Key signals |
|---|---|---|
bench-pipeline |
End-to-end rollout flow through all six queues | rollouts/sec, tokens/sec, queue depth |
bench-trace |
Latency between runtime stage boundaries | p50 / p90 / p99 for 8 event pairs |
bench-cow |
Memory savings from shared-prefix KV | memory saved %, alloc time, refcount overhead |
bench-tax |
Control-plane overhead comparison | eventfd vs userspace polling vs persistent worker |
bench-all |
Full suite in one command | reproducible snapshot for a given hardware config |
eventfd / syscall
Simple to implement. Pays full kernel entry and scheduler wakeup cost on every step.
Userspace polling
Lower latency. Burns a CPU core and requires careful core ownership to avoid NUMA effects.
Persistent worker + ring
Keeps execution near the queue. No per-step host orchestration. No kernel entry.
v0.2 turns the idea from "avoid CPU round-trips" into queues, states, counters, guards, traces, and benchmarks that make the claim testable.
v0.3 pushes rollout progression onto the GPU.
The CPU should admit work and observe completions in batches. The GPU-side worker should progress rollouts, update KV, sample tokens, and post reward and completion descriptors — without a CPU round-trip per token.
__global__ void rollout_worker(
device_ring_t *decode_q,
device_ring_t *reward_q,
device_ring_t *done_q,
kv_arena_t *kv)
{
while (1) {
rollout_desc_t r;
if (!device_ring_pop(decode_q, &r)) continue;
while (r.seq_len < r.max_tokens) {
int tok = decode_one_token(kv, r.kv_offset, r.seq_len);
append_kv(kv, r.kv_offset, r.seq_len, tok);
r.seq_len++;
if (needs_reward(tok)) {
device_ring_push(reward_q, r);
break;
}
if (is_done(tok)) {
device_ring_push(done_q, r);
break;
}
}
}
}
Measurable runtime skeleton
Rollout state, pipeline queues, hot-path guards, metrics, COW KV, reward handoff, tracing, backpressure, scheduling, and benchmarks — all on CPU.
GPU-resident progression
Persistent device-side workers move descriptors forward. CPU handles admission, placement, and coarse-grained observation only.
The runtime is the product.
v0.1 established the primitives. v0.2 made the runtime measurable. v0.3 should push rollout progression onto the GPU. That is the path from doorbells to rollouts.