MAN\SH AI / Writings
Systems · Close-to-Metal · RL Runtime

RL Inference
at GB300 Scale

A systems-level reference for eliminating the per-token control-plane tax — no page faults, no malloc/free, no syscalls, no scheduler wakeups, no CPU round-trips, and no KV migration in the hot decode path.

Hardware NVIDIA GB300 NVL72
Scale 72 × Grace Blackwell SMs
Stack C · CUDA · NVLink · NVSwitchFabric

Thesis. At massive RL inference scale, the bottleneck is not only faster matrix kernels. It is keeping decode, KV updates, sampling, reward scoring, and trajectory handoff moving continuously — with zero host-side orchestration stalls. The runtime must provide stable memory ownership, stable queue discipline, and minimal CPU intervention so the hardware never sits idle waiting for software bookkeeping.

CPU: descriptor publisher
GPU/NIC-visible rings
hugepage KV arenas
async reward handoff
Scale & Throughput

Why the CPU Becomes the Bottleneck

At GB300 NVL72 scale, the GPU rack delivers ~200 TFLOPS of decode compute. The host CPU cannot dispatch tokens fast enough without a close-to-metal runtime that eliminates every per-token kernel call, page fault, and scheduler intervention.

1.1M
tokens/s per rack (decode)
<5μs
per-token CPU dispatch
0
syscalls in hot path
8×144
GB HBM3e per GPU
System Layout

Rack-Scale Runtime Topology

Each GB300 tray holds one Grace CPU and four B300 GPUs. The Grace CPU runs the control-plane runtime; the GPUs own persistent decode workers that read commands from a lock-free SPSC ring shared via NVLink-C2C coherent memory.

CPU Control Plane
Enqueue descriptors into shared SPSC ring. Manage KV arena slab allocator. Poll completion events at ~100 Hz.
SPSC Descriptor Ring
NVLink-C2C coherent memory. Producers (CPU) write work descriptors; consumers (GPU SMs) pull them without kernel intervention.
Persistent GPU Workers
Each GPU SM runs a persistent warp that polls the ring, loads KV blocks, runs decode attention, and writes results back to the completion queue.
Completion + Reward
Reward model runs on GPU via NVLink. Results delivered via shared ring. No CPU copy or host wakeup per step.
Design Tenets

Six Pillars of Close-to-Metal RL Inference

The runtime eliminates every source of host-side latency that would otherwise stall the GPU decode pipeline. These six principles guide every design decision.

1

Zero page faults in the hot path

All memory — KV arena, scratch, output buffers — is pre-faulted with hugepages (2 MB or 1 GB). The GPU never waits for the CPU MMU to resolve a TLB miss.

2

No heap operations per token

KV slab allocator hands out fixed-size blocks at sequence start. No malloc, free, or cudaMalloc during decode. Memory ownership is static per trajectory.

3

Lock-free SPSC command dispatch

The CPU writes descriptors into an NVLink-C2C coherent ring; GPU SMs read them with load-acquire semantics. No mutexes, no syscalls, no wake-ups.

4

Persistent GPU workers

Each SM runs a persistent warp that polls the ring. No kernel launch overhead. The GPU effectively becomes a self-scheduled decode engine.

5

NUMA-local memory placement

KV arenas, page tables, NIC buffers, and command rings are allocated on the NUMA node closest to the GPU that owns them. No cross-socket traffic for memory.

6

GPU-resident reward scoring

The reward model runs on the GPU via NVLink. Results flow through a completion ring. The CPU never touches reward tensors in the hot path.

Memory Hierarchy

Where Data Lives at GB300 Scale

RL inference requires multiple memory tiers with explicit data placement. The runtime manages four distinct tiers, each with its own latency, capacity, and residency policy.

HBM3e (GPU)
~3 TB/s
144 GB
LPDDR5X (Grace)
~500 GB/s
~512 GB
NVLink-C2C Coherent
~900 GB/s
Shared pool
NVSwitch Fabric
1.8 TB/s
All-to-all

Key insight. The NVLink-C2C coherent window between Grace and Blackwell is the critical enabler. It lets the CPU publish work descriptors directly into GPU-visible memory without DMA, and lets GPU workers read CPU-managed KV arena metadata without cudaMemcpy.

The Bottleneck

RL Inference's Hidden Tax

Reinforcement learning inference compounds the standard serving problem: many independent trajectories, each with its own KV state, sampled tokens, and reward signal. Every step that touches the host OS or the heap adds latency that compounds across 10K+ concurrent sequences.

Traditional Serving

Per-token kernel launch

× cudaMalloc on each new sequence
× Page fault on first KV write
× CPU scheduler wakeup every decode step
× cudaMemcpy for reward tensor
× Kernel launch serialization on host thread
Close-to-Metal Runtime

Start-of-trajectory setup only

Pre-allocated hugepage KV arena
Hugepages: zero page faults at runtime
Persistent GPU workers: no wakeup per step
GPU-internal reward scoring via NVLink
SPSC descriptor rings: lock-free dispatch
GB300 NVL72

The Hardware We're Targeting

The NVIDIA GB300 NVL72 pairs 72 Grace Blackwell GPUs with Grace ARM CPUs over NVLink-C2C, delivering coherent CPU–GPU memory access and 1.8 TB/s of all-to-all GPU bandwidth through the NVSwitch fabric.

Compute
  • 72 × B300 GPUs — 144 GB HBM3e each
  • Grace ARM CPU — 72 ARM Neoverse V2 cores per tray
  • NVLink-C2C — coherent CPU–GPU at ~900 GB/s per tray
  • NVSwitch Gen5 — 1.8 TB/s all-to-all GPU fabric
Memory
  • 10.4 TB aggregate HBM3e across 72 GPUs
  • 2 TB/s LPDDR5X per Grace CPU (local DRAM)
  • NVLink-C2C coherent CPU–GPU memory pool
  • GB/s hugepage-backed KV arena pre-mapped at init
NUMA Topology

Memory Locality at Rack Scale

In a GB300 NVL72, each tray has a four-GPU pod connected to a single Grace CPU via NVLink-C2C. The runtime must pin every allocation to the correct NUMA node to avoid cross-socket bandwidth taxes.

CPU → GPU
  • NVLink-C2C provides cache-coherent shared memory
  • ~900 GB/s bidirectional bandwidth per tray
  • Latency ~200 ns for coherent loads
GPU → GPU (same tray)
  • NVLink 5 direct GPU–GPU links
  • ~900 GB/s per direction, no CPU hop
  • KV transfer for disaggregated prefill/decode
GPU → GPU (cross-tray)
  • NVSwitch Gen5 all-to-all fabric
  • 1.8 TB/s aggregate bandwidth
  • Trajectory replay across the rack

Critical. All persistent worker state, KV arena metadata, and command rings must reside in memory attached to the local Grace CPU's NUMA node. Cross-NUMA access adds 300–500 ns of latency per load — multiplied across every decode step.

Implementation

Close-to-Metal Runtime: Core Data Structures

The runtime is written in C with CUDA kernel entry points. All inter-process communication uses cache-coherent NVLink-C2C shared memory. No kernel launches, no syscalls, and no memory allocation occur in the per-token hot path.

Descriptor Ring — ring.h

The central dispatch structure. The CPU (producer) enqueues work descriptors; GPU SMs (consumers) dequeue them with load-acquire semantics. The ring lives in NVLink-C2C coherent memory and requires no cache flushes.

ring.h
/* Producer-consumer index pair, cache-line padded */
typedef struct __attribute__((packed)) {
  volatile uint32_t head;
  uint32_t           tail;
  uint8_t            pad[56];
} RingHeadTail;

typedef struct {
  RingHeadTail  hta __attribute__((aligned(64)));
  RingHeadTail  htb __attribute__((aligned(64)));
  Descriptor    slots[RING_SIZE] __attribute__((aligned(128)));
} CommandRing;

/* Acquire n contiguous slots; returns UINT32_MAX if full */
static inline uint32_t
ring_acquire(CommandRing *ring, uint32_t n) {
  uint32_t h = __atomic_load_n(&ring->hta.head, __ATOMIC_ACQUIRE);
  uint32_t t = __atomic_load_n(&ring->hta.tail, __ATOMIC_RELAXED);
  if (RING_SIZE - (h - t) < n) return UINT32_MAX;
  uint32_t pos = h & (RING_SIZE - 1);
  __atomic_store_n(&ring->hta.head, h + n, __ATOMIC_RELEASE);
  return pos;
}

Work Descriptor — descriptor.h

Each descriptor encodes a single decode step: which sequence, which KV block range, which attention parameters, and where to write the output token.

descriptor.h
typedef struct __attribute__((packed)) {
  uint64_t  seq_id;
  uint32_t  kv_block_offset;
  uint16_t  num_kv_blocks;
  uint8_t   attention_flags;
  uint8_t   pad;
  uint32_t  output_token_offset;
  uint64_t  reward_cookie;
} Descriptor;

static_assert(sizeof(Descriptor) == 24, "Descriptor must be 24 bytes");

Hugepage KV Arena — arena.h

The KV cache is pre-allocated as a contiguous hugepage-backed slab at process startup. Each trajectory receives a fixed-size block. The arena uses a bitmap allocator with O(1) acquire/release.

arena.h
#define GB(n) ((size_t)(n) << 30)

typedef struct {
  uint8_t  *base;
  size_t    capacity;
  size_t    block_size;
  uint64_t  bitmap[BITMAP_WORDS];
} KVArena;

/* Pre-fault all pages at init; guarantees no page faults at runtime */
void
arena_init(KVArena *a, size_t capacity, size_t block_size) {
  a->capacity   = capacity;
  a->block_size = block_size;
  a->base = mmap(NULL, capacity, PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, -1, 0);
  if (a->base == MAP_FAILED) abort();
  memset(a->base, 0, capacity);
  memset(a->bitmap, 0, sizeof(a->bitmap));
}

/* Acquire one block; returns index or -1 */
int64_t
arena_acquire(KVArena *a) {
  for (int w = 0; w < BITMAP_WORDS; w++) {
    uint64_t bits = ~a->bitmap[w];
    if (bits) {
      int b = __builtin_ctzll(bits);
      a->bitmap[w] |= (1ULL << b);
      return w * 64 + b;
    }
  }
  return -1;
}

Persistent GPU Worker — worker.cu

The persistent decode loop runs on every participating SM. It polls the command ring, processes descriptors, and writes completion tokens. No host interaction occurs between start-of-trajectory and end-of-trajectory.

worker.cu
// Persistent decode worker kernel (one warp per SM)
__global__ void
decode_worker(CommandRing *ring, KVArena *arena,
              OutputRing *out, uint64_t *step_count) {
  uint32_t lane = threadIdx.x & 31;
  while (true) {
    if (lane == 0) {
      uint32_t t = __atomic_load_n(&ring->htb.tail, __ATOMIC_ACQUIRE);
      uint32_t h = __atomic_load_n(&ring->hta.head, __ATOMIC_RELAXED);
      if (t < h) {
        uint32_t pos = t & (RING_SIZE - 1);
        Descriptor d = ring->slots[pos];
        __atomic_store_n(&ring->htb.tail, t + 1, __ATOMIC_RELEASE);
        process_descriptor(d, arena, out);
        if (step_count) atomicAdd(step_count, 1ULL);
      }
    }
  }
}

NUMA-Pinned Allocation — numa.c

All runtime memory is bound to the nearest NUMA node. The helper below allocates and faults hugepages on a specific node.

numa.c
#include <numa.h>

void *
numa_alloc_hugepages(int node, size_t size) {
  void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB,
                  -1, 0);
  if (p == MAP_FAILED) return NULL;
  struct bitmask *mask = numa_allocate_cpumask();
  numa_bitmask_setbit(mask, node);
  mbind(p, size, MPOL_BIND, mask->maskp, mask->size, MPOL_MF_MOVE | MPOL_MF_STRICT);
  memset(p, 0, size);
  numa_free_cpumask(mask);
  return p;
}

GPU-NIC Command Ring — atomic.h

When RL trajectories require network I/O (e.g., fetching checkpoints or sending rewards), the GPU communicates with the NIC via a dedicated atomic doorbell ring. The NIC polls a producer index; the GPU atomically advances it.

atomic.h
/* NIC-visible doorbell: GPU increments to notify NIC of new work */
typedef struct {
  volatile uint64_t doorbell __attribute__((aligned(64)));
  uint8_t            pad[56];
} NicDoorbell;

__device__ void
nic_ring_push(NicDoorbell *db, uint64_t cmd_desc_paddr) {
  *(uint64_t *)nic_cmd_slot = cmd_desc_paddr;
  __threadfence();
  atomicAdd((unsigned long long *)&db->doorbell, 1);
}
Bottleneck Map

What the Close-to-Metal Runtime Eliminates

Each design decision maps directly to a measurable bottleneck in the traditional RL inference pipeline. The table below summarizes the elimination strategy.

BottleneckTraditional CostClose-to-Metal SolutionImprovement
Page faults~5 μs per miss; TLB shootdown cascadesHugepage pre-fault at init; no runtime faults0 faults per token
Heap allocationmalloc/free per sequence; fragmentationSlab allocator over pre-mapped hugepagesO(1) acquire, zero fragmentation
Kernel launch~10 μs per CUDA launch; CPU serializationPersistent GPU workers polling a ring0 launches in hot path
CPU schedulerContext switch + wakeup latency (~4 μs)GPU self-scheduled; CPU only publishes descriptors0 wakeups per token
Reward copycudaMemcpy Device→Host per trajectory stepGPU-resident reward model; results via completion ring0 bytes across PCIe
NUMA remote access300–500 ns extra latency per loadExplicit mbind + NUMA-local allocationAll memory local to GPU
Rules of Thumb

Design Rules for Close-to-Metal RL

These five rules capture the essential engineering constraints. Violating any one re-introduces the host-side tax.

Rule 1

Pre-fault everything. If the GPU can trigger a page fault at runtime, the design is wrong. Every byte the GPU touches must be mapped, faulted, and pinned before the first decode step.

Rule 2

Never call cudaMalloc after init. All GPU memory is allocated once during startup. Trajectories recycle pre-assigned blocks. No malloc/free in the hot path.

Rule 3

Keep the CPU out of the data path. The CPU publishes descriptors and polls completions. It never touches KV cache entries, attention outputs, or reward tensors.

Rule 4

Pin memory to NUMA node 0 (or nearest). Use mbind(MPOL_BIND) with MPOL_MF_MOVE. Cross-NUMA latency is silent and deadly at scale.

Rule 5

Reward must be GPU-resident. The reward model runs on the same NVLink fabric as the policy model. Moving reward tensors through host DRAM burns bandwidth and adds latency.

Rule 6

NVLink-C2C is your friend. Coherent CPU–GPU memory eliminates DMA, cudaMemcpy, and driver-mediated transfers. Use it for command rings, metadata, and lightweight coordination.