MAN\SH AI / Writings

· RL Inference Systems · 20 min read

gb300-rl-runtime
Hardware / Software Co-Design · Systems Note

C to RTL:
Co-Simulating an RL
Inference Control Plane

gb300-rl-runtime | SystemVerilog + Verilator | ~20 min read

A standalone C runtime can over-assume. A standalone RTL block can over-assume. Co-simulation is the executable proof that they agree — on descriptor layout, valid/ready handshakes, FSM progression, completion semantics, and backpressure. This is how gb300-rl-runtime builds that proof.

Verilator Descriptor Contract Ready/Valid Worker FSM Completion Ring Backpressure DPI Bridge 64-byte Layout
Scope — This does not prove real transformer execution. It proves the narrower, important thing: that host software and the RTL control-plane engine share the same descriptor protocol, the same completion semantics, and the same backpressure behavior — verified by executable tests.
co-simulation architecture
HOST LAYER RTL LAYER RtlRuntimeBridge submit_decode() poll_completion() tick() reset() maps C++ fields → RTL port inputs / reads RTL port outputs rl_runtime_top (Verilated C++ model) mmio_regs doorbell_pulse desc_ring push_ready worker_fsm IDLE→DECODE →COMPLETE comp ring backpress. clk / rst_n — shared across all sub-modules desc_pkg.sv desc_t · completion_t · desc_opcode_t — imported by all RTL modules host_desc.* push_valid host_comp.* pop_valid Verilator C++ FFI tick() → eval()
Why co-simulation matters

A standalone C runtime and a standalone RTL block can both lie to themselves.

When you test them separately, each side controls its own assumptions. The C runtime can write a test that assumes the device always accepts descriptors. The RTL testbench can write a stimulus that drives host_comp_ready high every cycle. Both tests pass. Both are wrong.

Co-simulation removes that escape hatch. When the C++ bridge drives the Verilated RTL model directly, the C side must handle the case where push_ready is low. The RTL side must handle the case where the host holds pop_ready low for many cycles. Neither side can invent capacity the other doesn't have.

C-side over-assumptions

What software silently assumes

  • Always-ready ring: the device always has room for another descriptor — no check of push_ready
  • Instant completions: the completion appears within a bounded number of ticks — no timeout handling
  • Implicit encoding: the C struct field order and the RTL packed struct order are the same — no layout verification
  • Status constants: 0x01=DONE, 0x02=REWARD_NEEDED match on both sides — no enum validation
  • Backpressure doesn't happen: host_comp_ready=1 forever in unit tests — completion ring never fills
RTL-side over-assumptions

What hardware silently assumes

  • Well-formed push_valid: host never asserts valid with an incomplete or malformed descriptor
  • Ready respected: host checks push_ready before each submit — never fires when ring is full
  • Completion drained promptly: host holds pop_ready high — completion ring never stalls the worker FSM
  • Field widths correct: rollout_id fits in 16 bits, max_tokens is non-zero, opcode is a valid enum value
  • Reset before use: host always drives reset before the first descriptor submission
Co-simulation makes the interface executable. The host must drive the RTL correctly, and the RTL must expose completions and backpressure honestly — or the test fails.

The practical consequence: when the co-sim tests pass, you have a proof that the descriptor layout is correct, the valid/ready handshakes work in both directions, the FSM reaches every completion state, and backpressure propagates end-to-end. That proof is cheap to get now and expensive to discover missing in silicon.

The vertical stack

The repo now spans runtime code, RTL, and a co-simulation bridge.

Each layer has a single job. The C/CUDA runtime focuses on efficient descriptor construction and completion polling. The RTL engine owns queue mechanics and FSM progression. The Verilator bridge is the adhesive — a thin, boring piece of code whose only job is to faithfully map one side's types onto the other's port signals.

full data path — left to right
C/CUDA Runtime
Builds hw_desc_t. Calls infer_submit_decode(). Polls for completions. Handles reward checkpoints.
Verilator Bridge
Maps C++ struct fields onto RTL input ports. Calls eval() each tick. Reads output ports into RtlCompletion.
RTL Engine
mmio_regsdesc_ringrollout_worker_fsmcompletion_ring. All connected in rl_runtime_top.
Completion Path
Worker emits comp_valid. Ring stalls if host holds pop_ready=0. Host polls pop_valid via bridge.
Contract Tests
Decode → DONE. Decode → REWARD_NEEDED. Backpressure hold → no lost completions. All must pass.
    
co-sim path
sim/run_sim.cpp
submit_decode(rollout_id, seq_len, max_tokens, ...) ↓ bridge maps C++ → RTL input ports Verilated rl_runtime_top::eval() // advance one clock ↓ mmio_regs: doorbell_pulse asserted ↓ desc_ring: push_valid && push_ready → latch descriptor ↓ worker_fsm: S_IDLE → S_DECODE → token_count++ ↓ worker_fsm: token_count >= max_tokens → S_COMPLETE ↓ completion_ring: comp_valid; stall if !comp_ready ↓ bridge reads output ports poll_completion() → RtlCompletion{ rollout_id, status, final_seq_len }
Repo layout: rtl/ holds all SystemVerilog source. sim/ holds the Verilator bridge (cosim_bridge.cpp), the C host API (infer_api.h), and the test driver (run_sim.cpp). tb/ holds the standalone SystemVerilog testbench for RTL-only verification. The bridge and the testbench test the same RTL, from opposite sides of the interface.
Bridge API

The bridge should be small, boring, and contract-focused.

The bridge class has five responsibilities and nothing else: initialize the Verilated model, drive reset, tick the clock, map descriptors onto input ports, and read completions from output ports. Any logic beyond that belongs in either the C runtime layer or the RTL layer.

    
C++
sim/cosim_bridge.h
struct RtlCompletion { uint16_t rollout_id; uint8_t status; // 0x01=DONE, 0x02=REWARD_NEEDED, 0xFF=ERROR uint16_t final_seq_len; uint16_t reward_id; }; class RtlRuntimeBridge { public: RtlRuntimeBridge(); // construct + allocate Verilated model ~RtlRuntimeBridge(); // delete Verilated model void reset(unsigned cycles = 5); // drive rst_n=0 for N ticks, then release void tick(); // advance clk 0→1→0, call eval() each edge bool submit_decode( uint16_t rollout_id, uint16_t seq_len, uint16_t max_tokens, uint16_t reward_model_id, uint16_t kv_arena_id = 0, uint16_t prefix_id = 0, uint32_t kv_offset = 0, uint32_t delta_offset = 0 ); // returns false if push_ready=0 (ring full) bool poll_completion(RtlCompletion *out); // returns true + fills *out if pop_valid=1 uint64_t cycles() const; // monotonic cycle counter for timing budgets private: Vrl_runtime_top *top_; // Verilator-generated model class VerilatedContext *ctx_; // Verilator simulation context uint64_t cycle_; };
reset()

Known state

Drives rst_n=0 for N rising edges. All RTL registers clear. Ring head/tail reset to zero. FSM returns to S_IDLE. Required before the first submit_decode call — the bridge constructor calls it automatically.

submit_decode()

Drive descriptor

Checks push_ready first. If low, returns false — caller must retry or yield. If high, maps each C++ argument onto its RTL input port, asserts push_valid for one cycle, then de-asserts. Returns true on successful handshake.

poll_completion()

Observe result

Samples pop_valid. If low, returns false. If high, reads all host_comp_* output ports into *out, asserts pop_ready for one cycle to acknowledge, then returns true. Never invents a completion.

Bridge implementation

The inner loop is three lines. The discipline is in the handshakes.

Most of the bridge implementation is mechanical field mapping. The interesting parts are the handshake discipline in submit_decode, the clock-edge sequencing inside tick, and the backpressure behavior that prevents lost completions.

    
C++
sim/cosim_bridge.cpp — tick()
// One full clock cycle: rising edge eval, then falling edge eval. // Verilator requires eval() on each edge for registered outputs to propagate. void RtlRuntimeBridge::tick() { top_->clk = 1; top_->eval(); // rising edge: flip-flops sample inputs top_->clk = 0; top_->eval(); // falling edge: combinational logic re-resolves ++cycle_; }
    
C++
sim/cosim_bridge.cpp — submit_decode()
bool RtlRuntimeBridge::submit_decode( uint16_t rollout_id, uint16_t seq_len, uint16_t max_tokens, uint16_t reward_model_id, uint16_t kv_arena_id, uint16_t prefix_id, uint32_t kv_offset, uint32_t delta_offset) { // 1. Check backpressure — do not drive valid if ring is full if (!top_->push_ready) return false; // 2. Map C++ arguments onto RTL descriptor input ports top_->host_desc_opcode = DESC_OP_DECODE; // 8'd1 top_->host_desc_flags = 0; top_->host_desc_rollout_id = rollout_id; top_->host_desc_kv_arena_id = kv_arena_id; top_->host_desc_prefix_id = prefix_id; top_->host_desc_kv_offset = kv_offset; top_->host_desc_delta_offset = delta_offset; top_->host_desc_seq_len = seq_len; top_->host_desc_max_tokens = max_tokens; top_->host_desc_reward_model_id = reward_model_id; top_->host_desc_reserved = 0; // 3. Assert push_valid for one clock — descriptor is latched on rising edge top_->host_push_valid = 1; tick(); top_->host_push_valid = 0; // de-assert; ring owns the slot now tick(); // one idle tick before next operation return true; }
    
C++
sim/cosim_bridge.cpp — poll_completion()
bool RtlRuntimeBridge::poll_completion(RtlCompletion *out) { // 1. Check if RTL has a completion ready — non-blocking poll if (!top_->host_comp_valid) return false; // 2. Read completion fields from RTL output ports out->rollout_id = top_->host_comp_rollout_id; out->status = top_->host_comp_status; out->final_seq_len = top_->host_comp_final_seq_len; out->reward_id = top_->host_comp_reward_id; // 3. Acknowledge — assert pop_ready for one cycle to advance ring head top_->host_comp_ready = 1; tick(); top_->host_comp_ready = 0; tick(); return true; // If this function is NOT called: comp_valid stays high, worker stays // in S_COMPLETE, desc_ring fills up, push_ready goes low. Correct. }
Why two ticks after de-assert? Verilator is cycle-accurate. After de-asserting host_push_valid, the RTL combinational logic needs one clock edge to settle before the next port read reflects the updated ring state. Skipping the idle tick can cause the bridge to see stale push_ready values and submit to a full ring.

Descriptor field mapping

C++ argumentRTL portdesc_pkg fieldWidthNotes
rollout_idhost_desc_rollout_iddesc_t.rollout_id16btrajectory identifier, echoed in completion
seq_lenhost_desc_seq_lendesc_t.seq_len16bsequence length at dispatch time
max_tokenshost_desc_max_tokensdesc_t.max_tokens16bgeneration budget; FSM terminates when reached
reward_model_idhost_desc_reward_model_iddesc_t.reward_model_id16bechoed in REWARD_NEEDED completion
kv_arena_idhost_desc_kv_arena_iddesc_t.kv_arena_id16bKV cache slab selector
kv_offsethost_desc_kv_offsetdesc_t.kv_offset32bbyte offset within KV arena
delta_offsethost_desc_delta_offsetdesc_t.delta_offset32bnew-token buffer offset
hardcodedhost_desc_opcodeDESC_OP_DECODE = 8'd18bfixed for submit_decode; other ops need separate methods
Physical layout target: This is currently a logical mapping — each field is wired individually onto separate RTL ports. The next step is a physical layout: pack all fields into a single 192-bit (current) or 512-bit (target) port so the bridge can do one memcpy from the C struct into the RTL input port and the bit layout is verified to be identical on both sides.
Co-simulation tests

The tests prove the control-plane loop — all the way around.

Three tests cover the three protocol behaviors that matter: the happy path, the RL-specific reward transition, and the backpressure case that software tests almost always skip.

test 1

Basic decode → DONE

Submit rollout 7 with max_tokens=10, reward_model_id=1. Tick until poll_completion returns true. Assert status==0x01, rollout_id==7, final_seq_len==10. Verifies the full IDLE→DECODE→COMPLETE path and that the bridge reads outputs correctly.

test 2

Reward boundary → REWARD_NEEDED

Submit with max_tokens=64. The FSM emits REWARD_NEEDED at the configured 32-token boundary before reaching max_tokens. Assert status==0x02 at final_seq_len==32, then a second completion with status==0x01 at final_seq_len==64. Verifies the RL-specific mid-rollout completion path.

test 3

Backpressure → no lost completions

Submit a descriptor. Hold host_comp_ready=0 for 20 ticks after the worker reaches S_COMPLETE. Worker must remain in S_COMPLETE; comp_valid must stay asserted. Release host_comp_ready. Assert completion is received correctly. Verifies that the completion ring never silently drops.

    
C++
sim/run_sim.cpp — test runner sketch
// test 1: basic decode bridge.reset(); bool ok = bridge.submit_decode(/*rollout_id=*/7, /*seq_len=*/0, /*max_tokens=*/10, /*reward_model_id=*/1); ASSERT(ok, "submit failed — push_ready was low?"); RtlCompletion c{}; for (int i = 0; i < 200 && !bridge.poll_completion(&c); ++i) bridge.tick(); // tick until completion or timeout ASSERT(c.status == 0x01, "expected DONE"); ASSERT(c.rollout_id == 7, "rollout_id mismatch"); ASSERT(c.final_seq_len == 10, "seq_len mismatch"); // test 3: backpressure — hold comp_ready low for 20 ticks bridge.reset(); bridge.submit_decode(42, 0, 5, 1); for (int i = 0; i < 100; ++i) bridge.tick(); // wait for worker to reach S_COMPLETE bridge.top_->host_comp_ready = 0; // hold backpressure for (int i = 0; i < 20; ++i) { bridge.tick(); ASSERT(bridge.top_->host_comp_valid, "comp_valid must stay high"); } bridge.top_->host_comp_ready = 1; // release — completion arrives RtlCompletion c2{}; bool got = bridge.poll_completion(&c2); ASSERT(got && c2.rollout_id == 42, "completion lost under backpressure");
expected output
// RTL co-simulation suite — gb300-rl-runtime
RTL co-sim [test 1]: basic decode — rollout_id=7, status=DONE, final_seq_len=10 — PASS
RTL co-sim [test 2]: reward boundary — REWARD_NEEDED@32, DONE@64 — PASS
RTL co-sim [test 3]: completion backpressure — comp_valid held 20 ticks, no loss — PASS
RTL bridge contract tests: 3/3 PASS
A SystemVerilog testbench asks: does the RTL work by itself? A C++ Verilator bridge asks: can host software drive the RTL the way it expects to drive hardware? You need both questions answered.
Honesty boundary

What this does and does not prove.

It is easy to over-claim what a co-simulation result means. The bridge tests the control-plane protocol, not the inference computation. Here is the precise scope.

Proven ✓

Control-plane contract

  • Descriptor layout is correct: all fields map correctly from C++ to RTL ports
  • Valid/ready handshake works in the submit direction
  • Valid/ready handshake works in the completion direction
  • Worker FSM reaches S_COMPLETE for DECODE descriptors
  • Worker FSM emits REWARD_NEEDED at the correct token boundary
  • Completion ring retains completions under backpressure (no silent drops)
  • Reset correctly initializes all RTL state
  • C++ bridge correctly maps field widths and opcode constants
Not proven ✗

Real inference execution

  • Real transformer attention — no matrix math in RTL
  • Tensor-core or CUDA kernel scheduling
  • GB300 / Blackwell hardware behavior
  • NVIDIA MMIO doorbell semantics (NVLink, PCIe BAR)
  • HBM bandwidth or latency effects
  • NVSwitch topology or multi-GPU routing
  • Production inference serving throughput or tail latency
  • Physical descriptor layout (current gap — see §Next)
Next milestone

Unify the descriptor into a 64-byte shared physical contract.

Right now the bridge maps fields one-by-one onto individual RTL ports. This validates the logical contract — the field names, widths, and opcode values are consistent. What it does not validate is the physical layout: are the bits in the same position in the C struct as in the RTL packed descriptor?

In a real hardware system this matters. The CPU writes a cache-line-sized descriptor into DRAM. The DMA engine reads it. The hardware queue engine parses it. If the C struct and the RTL struct have different field ordering, endianness assumptions, or padding, the hardware will misparse every work order. The next milestone closes this gap.

    
target contract
shared physical layout — v0.6 goal
// Both sides must agree on every bit of this layout C: __attribute__((packed, aligned(64))) hw_desc_t → 64 bytes, little-endian RTL: typedef struct packed { ... } desc_t → 512 bits, same field order Validation strategy: 1. Static assert: sizeof(hw_desc_t) == 64 2. Static assert: offsetof(hw_desc_t, rollout_id) == 2 // after opcode + flags 3. Verilator bridge: single DPI call with raw uint8_t[64] bus instead of N ports 4. Co-sim test: write known byte pattern, assert RTL parses fields correctly 5. Reverse test: RTL emits completion; assert C struct parses it correctly
v0.5

C + Verilator co-simulation — current milestone

Logical field mapping. C++ bridge drives RTL ports directly. Three contract tests: decode, reward boundary, backpressure. All pass.

v0.6

64-byte physical descriptor layout

Shared bit-level layout between C struct and RTL packed struct. Single raw bus in bridge. Static-assert validation. C and RTL must agree on every byte position and endianness assumption.

v0.7

Multi-worker dispatch

Descriptor ring → round-robin dispatcher → N rollout worker FSMs → completion merge arbiter → completion ring. Per-worker token counters. M concurrent trajectories across N workers. Completion ordering tests.

v0.8

AXI4-Lite MMIO + MSI-X interrupt

Replace bare MMIO block with proper AXI4-Lite slave interface. Add interrupt generator: assert IRQ line when completion ring is non-empty and interrupt enable bit is set. Host can switch from polling to interrupt-driven completion handling.

Final mental model

The repo is becoming a contract lab.

After building the co-simulation bridge, the relationship between the C runtime and the RTL engine changes. They are no longer two separate things that happen to use the same vocabulary. They are two implementations of one contract — and the bridge is the test harness that proves they agree.

C runtime

Submit work

How do I submit rollout descriptors to a hardware queue without CPU involvement per token? Write the descriptor, check push_ready, assert push_valid, poll for completion.

RTL engine

Consume work

What hardware queue protocol should accept descriptors, simulate rollout progression, emit reward checkpoints, and propagate backpressure — all without software assistance mid-rollout?

Verilator bridge

Prove contract

Can the software runtime and the RTL engine agree on that protocol at the bit level, validated by executable tests that exercise the happy path, reward path, and backpressure path?

The bridge from runtime code to hardware-shaped inference design is the descriptor control-plane contract. Co-simulation is the proof it is real.