MAN\SH AI / Writings
gb300-rl-runtime
Follow-up systems note · v0.2

From Doorbells
to Rollouts

How gb300-rl-runtime v0.2 evolved from a close-to-metal idea into a measurable C/CUDA runtime skeleton for RL inference.

Core thesis

RL inference at scale is not just token serving. It is a state-machine runtime for generating, scoring, storing, and feeding back experience — while keeping accelerators continuously useful.

rollout lifecycle
FREE · slot available
PREFILL_READY context loaded
DECODING token loop
REWARD_PENDING scoring boundary
TRAJECTORY_READY appending result
DONE · complete
Why this runtime exists

The per-token path must be protected.

The original premise was direct: for close-to-metal RL inference at GB300 scale, every per-token step should be free of page faults, malloc/free, syscalls, scheduler wakeups, CPU round-trips, and KV migration.

Traditional serving

Request → response

Inference is framed as a user request entering prefill and decode, then returning an answer. The unit of work is the request.

RL inference

Rollout → reward → update

The model generates experience, hits reward and verifier boundaries, stores trajectories, and feeds training loops. The unit of work is the rollout.

Stage 1

Rollout generation

Many concurrent streams decode tokens, actions, tool calls, or simulator steps.

Stage 2

Reward / verifier

Outputs are scored, validated, filtered, and converted into trajectory data.

Stage 3

Policy handoff

Completed trajectories feed updates and generate more inference work.

The hard part is not writing faster CUDA kernels. The hard part is designing the runtime around the kernels so the accelerator is never waiting on host orchestration.

v0.2 architecture

Rollouts become explicit state machines.

A rollout is not just a request. It is an object moving through a lifecycle: admission, decode, reward, trajectory, and completion. Naming the states gives the runtime a concrete unit of scheduling.

C — rollout_t
typedef enum {
    ROLL_FREE             = 0,
    ROLL_PREFILL_READY,
    ROLL_DECODING,
    ROLL_REWARD_PENDING,
    ROLL_TRAJECTORY_READY,
    ROLL_DONE
} rollout_state_t;

typedef struct {
    uint32_t rollout_id;
    uint32_t state;
    uint32_t kv_arena_id;
    uint32_t kv_offset;
    uint32_t seq_len;
    uint32_t max_tokens;
    uint32_t reward_id;
    uint32_t flags;
} rollout_t;
C — atomic transition
int rollout_transition(rollout_t *r,
                       rollout_state_t from,
                       rollout_state_t to)
{
    uint32_t expected = (uint32_t)from;
    return __atomic_compare_exchange_n(
        &r->state, &expected, (uint32_t)to,
        0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE);
}
Design contract: rollout progression is explicit, atomic, and visible to the runtime. That makes scheduling, backpressure, tracing, and KV ownership easier to reason about — and easier to test.
Multi-queue pipeline

Six queues replace one generic scheduler.

v0.2 moves rollouts through six lock-free SPSC queues. RL inference becomes a flow of compact IDs — no heap allocations, no callback dispatch, no Python object lifecycle.

C — rollout_pipeline_t
typedef struct {
    ring_t  free_q;
    ring_t  prefill_q;
    ring_t  decode_q;
    ring_t  reward_q;
    ring_t  trajectory_q;
    ring_t  done_q;
} rollout_pipeline_t;
Visibility

Where is the work?

The runtime can answer which queue owns any rollout at any point in time.

Backpressure

Which stage is full?

Queue depth becomes a first-class signal for admission control and scheduling decisions.

Measurement

Where is the latency?

Tracepoints can measure time between queue boundaries — revealing exactly which stage hides p99.

flow control

Fast pipelines still need brakes. v0.2 adds credit-based backpressure so decode cannot outrun reward, and reward cannot outrun trajectory storage.

C — pipeline_credits_t
typedef struct {
    uint32_t decode_credits;
    uint32_t reward_credits;
    uint32_t trajectory_credits;
    uint32_t kv_blocks_free;
} pipeline_credits_t;
SCHED_FIFO

Baseline fairness

Simple and predictable. May not minimise memory pressure or p99 latency under skewed workloads.

Shortest remaining

Finish short work first

Reduces queue residency and can improve median completion latency under admission pressure.

Prefix sharing

Exploit KV locality

Prioritise rollouts that share prefix state, reducing duplicate KV pressure and increasing effective throughput.

Runtime fast path

Command rings publish work orders, not per-token instructions.

The CPU should not wake up and schedule every token. It should post compact descriptors and let persistent GPU workers advance execution without further host involvement.

C — doorbell fast path (3 lines)
q->entries[tail & RING_MASK] = desc;
__atomic_store_n(&q->tail, tail + 1, __ATOMIC_RELEASE);
*(volatile uint32_t *)gpu_doorbell = tail + 1;

The CPU writes an order ticket, places it on the queue, rings the bell, and walks away. The accelerator side picks it up on its own schedule.

C — cacheline-aware SPSC ring
#define RING_SIZE  (1u << 16)
#define RING_MASK  (RING_SIZE - 1)

struct ring {
    _Alignas(64) volatile uint32_t head;
    char pad1[60];

    _Alignas(64) volatile uint32_t tail;
    char pad2[60];

    struct rollout_desc entries[RING_SIZE];
};
Why the padding matters: head and tail must live on separate cachelines. A false-sharing miss between producer and consumer would add ~200 cycles per operation — defeating the entire purpose of a lock-free ring.
Avoid

CPU micromanagement

  • Wake CPU per token
  • Schedule every decode step from host
  • Allocate task objects at runtime
  • Enter kernel repeatedly via eventfd
  • Move KV ownership across threads
Prefer

Descriptor ownership

  • Post compact rollout descriptors once
  • Publish with acquire/release ordering
  • Ring GPU doorbell from coherent memory
  • Persistent worker drains queue autonomously
  • Completions observed in batches, not per-token
Memory runtime

Copy-on-write prefix KV makes rollouts memory-aware.

RL workloads routinely create many rollouts from the same prompt. Duplicating the full prefix KV cache for every branch wastes memory proportional to rollout count. v0.2 models a shared prefix plus per-rollout delta KV.

shared prompt prefix KV rollout A — delta KV rollout B — delta KV rollout C — delta KV rollout D — delta KV
C — kv_prefix_t / kv_branch_t
typedef struct {
    uint32_t prefix_id;
    uint32_t refcnt;
    uint32_t kv_arena_id;
    uint64_t kv_offset;
    uint32_t token_len;
    uint32_t flags;
} kv_prefix_t;

typedef struct {
    uint32_t rollout_id;
    uint32_t prefix_id;
    uint64_t delta_kv_offset;
    uint32_t delta_len;
} kv_branch_t;
Why this matters: the runtime is no longer treating the KV cache as anonymous memory. It understands that many rollout branches share a prefix and can schedule around that fact — batching rollouts with the same prefix, delaying eviction of hot prefixes, and allocating only delta storage per branch.
Measurement

Tracing turns the thesis into a latency taxonomy.

Counters show totals. Traces show where time goes. If the claim is "remove control-plane overhead," the runtime needs timestamps at stage boundaries — not just aggregate throughput.

Trace boundary Question answered Why it matters
descriptor posted → descriptor consumed Is queue visibility slow? Measures command-ring latency end-to-end.
reward posted → reward scored Is reward the bottleneck? Shows verifier backlog and p99 scoring delay.
decode complete → trajectory done Is logging in the critical path? Prevents trajectory storage from silently stalling decode.
completion visible → host observed Is polling too coarse? Measures completion observation delay on the CPU side.
C — trace_event_t
typedef struct {
    uint64_t ts_ns;
    uint32_t event_type;
    uint32_t rollout_id;
    uint32_t stage;
} trace_event_t;

The trace ring is separate from the pipeline rings — it is never in the hot path. Events are collected off-thread and reported as p50/p90/p99 pair latencies after each benchmark run.

Benchmarks

v0.2 makes the claims testable.

The milestone is not beating production inference stacks. The milestone is turning "avoid CPU round-trips" into queues, states, counters, guards, traces, and benchmarks that can be measured and falsified.

Benchmark What it proves Key signals
bench-pipeline End-to-end rollout flow through all six queues rollouts/sec, tokens/sec, queue depth
bench-trace Latency between runtime stage boundaries p50 / p90 / p99 for 8 event pairs
bench-cow Memory savings from shared-prefix KV memory saved %, alloc time, refcount overhead
bench-tax Control-plane overhead comparison eventfd vs userspace polling vs persistent worker
bench-all Full suite in one command reproducible snapshot for a given hardware config
Mode A

eventfd / syscall

Simple to implement. Pays full kernel entry and scheduler wakeup cost on every step.

vs
Mode B

Userspace polling

Lower latency. Burns a CPU core and requires careful core ownership to avoid NUMA effects.

vs
Mode C — this runtime

Persistent worker + ring

Keeps execution near the queue. No per-step host orchestration. No kernel entry.

v0.2 turns the idea from "avoid CPU round-trips" into queues, states, counters, guards, traces, and benchmarks that make the claim testable.

Next milestone

v0.3 pushes rollout progression onto the GPU.

The CPU should admit work and observe completions in batches. The GPU-side worker should progress rollouts, update KV, sample tokens, and post reward and completion descriptors — without a CPU round-trip per token.

CUDA — conceptual rollout_worker
__global__ void rollout_worker(
    device_ring_t *decode_q,
    device_ring_t *reward_q,
    device_ring_t *done_q,
    kv_arena_t   *kv)
{
    while (1) {
        rollout_desc_t r;
        if (!device_ring_pop(decode_q, &r)) continue;

        while (r.seq_len < r.max_tokens) {
            int tok = decode_one_token(kv, r.kv_offset, r.seq_len);
            append_kv(kv, r.kv_offset, r.seq_len, tok);
            r.seq_len++;

            if (needs_reward(tok)) {
                device_ring_push(reward_q, r);
                break;
            }
            if (is_done(tok)) {
                device_ring_push(done_q, r);
                break;
            }
        }
    }
}
v0.2

Measurable runtime skeleton

Rollout state, pipeline queues, hot-path guards, metrics, COW KV, reward handoff, tracing, backpressure, scheduling, and benchmarks — all on CPU.

v0.3

GPU-resident progression

Persistent device-side workers move descriptors forward. CPU handles admission, placement, and coarse-grained observation only.

Final takeaway

The runtime is the product.

The point is not simply that C is fast. The point is that at large-scale RL inference, the runtime decides where rollout state lives, who owns the KV cache, when reward work runs, which queues are saturated, which rollouts are scheduled next, and where p99 latency is hiding.

v0.1 established the primitives. v0.2 made the runtime measurable. v0.3 should push rollout progression onto the GPU. That is the path from doorbells to rollouts.