C to RTL:
Co-Simulating an RL
Inference Control Plane
A standalone C runtime can over-assume. A standalone RTL block can over-assume. Co-simulation is the executable proof that they agree — on descriptor layout, valid/ready handshakes, FSM progression, completion semantics, and backpressure. This is how gb300-rl-runtime builds that proof.
A standalone C runtime and a standalone RTL block can both lie to themselves.
When you test them separately, each side controls its own assumptions.
The C runtime can write a test that assumes the device always accepts descriptors.
The RTL testbench can write a stimulus that drives host_comp_ready
high every cycle. Both tests pass. Both are wrong.
Co-simulation removes that escape hatch. When the C++ bridge drives the Verilated RTL
model directly, the C side must handle the case where push_ready is low.
The RTL side must handle the case where the host holds pop_ready low for
many cycles. Neither side can invent capacity the other doesn't have.
What software silently assumes
- Always-ready ring: the device always has room for another descriptor — no check of
push_ready - Instant completions: the completion appears within a bounded number of ticks — no timeout handling
- Implicit encoding: the C struct field order and the RTL packed struct order are the same — no layout verification
- Status constants:
0x01=DONE,0x02=REWARD_NEEDEDmatch on both sides — no enum validation - Backpressure doesn't happen:
host_comp_ready=1forever in unit tests — completion ring never fills
What hardware silently assumes
- Well-formed push_valid: host never asserts valid with an incomplete or malformed descriptor
- Ready respected: host checks
push_readybefore each submit — never fires when ring is full - Completion drained promptly: host holds
pop_readyhigh — completion ring never stalls the worker FSM - Field widths correct:
rollout_idfits in 16 bits,max_tokensis non-zero,opcodeis a valid enum value - Reset before use: host always drives reset before the first descriptor submission
The practical consequence: when the co-sim tests pass, you have a proof that the descriptor layout is correct, the valid/ready handshakes work in both directions, the FSM reaches every completion state, and backpressure propagates end-to-end. That proof is cheap to get now and expensive to discover missing in silicon.
The repo now spans runtime code, RTL, and a co-simulation bridge.
Each layer has a single job. The C/CUDA runtime focuses on efficient descriptor construction and completion polling. The RTL engine owns queue mechanics and FSM progression. The Verilator bridge is the adhesive — a thin, boring piece of code whose only job is to faithfully map one side's types onto the other's port signals.
hw_desc_t. Calls infer_submit_decode(). Polls for completions. Handles reward checkpoints.eval() each tick. Reads output ports into RtlCompletion.mmio_regs → desc_ring → rollout_worker_fsm → completion_ring. All connected in rl_runtime_top.comp_valid. Ring stalls if host holds pop_ready=0. Host polls pop_valid via bridge.
co-sim pathsim/run_sim.cpp
submit_decode(rollout_id, seq_len, max_tokens, ...)
↓ bridge maps C++ → RTL input ports
Verilated rl_runtime_top::eval() // advance one clock
↓ mmio_regs: doorbell_pulse asserted
↓ desc_ring: push_valid && push_ready → latch descriptor
↓ worker_fsm: S_IDLE → S_DECODE → token_count++
↓ worker_fsm: token_count >= max_tokens → S_COMPLETE
↓ completion_ring: comp_valid; stall if !comp_ready
↓ bridge reads output ports
poll_completion() → RtlCompletion{ rollout_id, status, final_seq_len }
rtl/ holds all SystemVerilog source.
sim/ holds the Verilator bridge (cosim_bridge.cpp),
the C host API (infer_api.h), and the test driver (run_sim.cpp).
tb/ holds the standalone SystemVerilog testbench for RTL-only verification.
The bridge and the testbench test the same RTL, from opposite sides of the interface.
The bridge should be small, boring, and contract-focused.
The bridge class has five responsibilities and nothing else: initialize the Verilated model, drive reset, tick the clock, map descriptors onto input ports, and read completions from output ports. Any logic beyond that belongs in either the C runtime layer or the RTL layer.
C++sim/cosim_bridge.h
struct RtlCompletion {
uint16_t rollout_id;
uint8_t status; // 0x01=DONE, 0x02=REWARD_NEEDED, 0xFF=ERROR
uint16_t final_seq_len;
uint16_t reward_id;
};
class RtlRuntimeBridge {
public:
RtlRuntimeBridge(); // construct + allocate Verilated model
~RtlRuntimeBridge(); // delete Verilated model
void reset(unsigned cycles = 5); // drive rst_n=0 for N ticks, then release
void tick(); // advance clk 0→1→0, call eval() each edge
bool submit_decode(
uint16_t rollout_id,
uint16_t seq_len,
uint16_t max_tokens,
uint16_t reward_model_id,
uint16_t kv_arena_id = 0,
uint16_t prefix_id = 0,
uint32_t kv_offset = 0,
uint32_t delta_offset = 0
); // returns false if push_ready=0 (ring full)
bool poll_completion(RtlCompletion *out); // returns true + fills *out if pop_valid=1
uint64_t cycles() const; // monotonic cycle counter for timing budgets
private:
Vrl_runtime_top *top_; // Verilator-generated model class
VerilatedContext *ctx_; // Verilator simulation context
uint64_t cycle_;
};
Known state
Drives rst_n=0 for N rising edges. All RTL registers clear. Ring head/tail reset to zero. FSM returns to S_IDLE. Required before the first submit_decode call — the bridge constructor calls it automatically.
Drive descriptor
Checks push_ready first. If low, returns false — caller must retry or yield. If high, maps each C++ argument onto its RTL input port, asserts push_valid for one cycle, then de-asserts. Returns true on successful handshake.
Observe result
Samples pop_valid. If low, returns false. If high, reads all host_comp_* output ports into *out, asserts pop_ready for one cycle to acknowledge, then returns true. Never invents a completion.
The inner loop is three lines. The discipline is in the handshakes.
Most of the bridge implementation is mechanical field mapping. The interesting
parts are the handshake discipline in submit_decode, the
clock-edge sequencing inside tick, and the backpressure
behavior that prevents lost completions.
C++sim/cosim_bridge.cpp — tick()
// One full clock cycle: rising edge eval, then falling edge eval.
// Verilator requires eval() on each edge for registered outputs to propagate.
void RtlRuntimeBridge::tick() {
top_->clk = 1;
top_->eval(); // rising edge: flip-flops sample inputs
top_->clk = 0;
top_->eval(); // falling edge: combinational logic re-resolves
++cycle_;
}
C++sim/cosim_bridge.cpp — submit_decode()
bool RtlRuntimeBridge::submit_decode(
uint16_t rollout_id, uint16_t seq_len, uint16_t max_tokens,
uint16_t reward_model_id, uint16_t kv_arena_id,
uint16_t prefix_id, uint32_t kv_offset, uint32_t delta_offset)
{
// 1. Check backpressure — do not drive valid if ring is full
if (!top_->push_ready) return false;
// 2. Map C++ arguments onto RTL descriptor input ports
top_->host_desc_opcode = DESC_OP_DECODE; // 8'd1
top_->host_desc_flags = 0;
top_->host_desc_rollout_id = rollout_id;
top_->host_desc_kv_arena_id = kv_arena_id;
top_->host_desc_prefix_id = prefix_id;
top_->host_desc_kv_offset = kv_offset;
top_->host_desc_delta_offset = delta_offset;
top_->host_desc_seq_len = seq_len;
top_->host_desc_max_tokens = max_tokens;
top_->host_desc_reward_model_id = reward_model_id;
top_->host_desc_reserved = 0;
// 3. Assert push_valid for one clock — descriptor is latched on rising edge
top_->host_push_valid = 1;
tick();
top_->host_push_valid = 0; // de-assert; ring owns the slot now
tick(); // one idle tick before next operation
return true;
}
C++sim/cosim_bridge.cpp — poll_completion()
bool RtlRuntimeBridge::poll_completion(RtlCompletion *out) {
// 1. Check if RTL has a completion ready — non-blocking poll
if (!top_->host_comp_valid) return false;
// 2. Read completion fields from RTL output ports
out->rollout_id = top_->host_comp_rollout_id;
out->status = top_->host_comp_status;
out->final_seq_len = top_->host_comp_final_seq_len;
out->reward_id = top_->host_comp_reward_id;
// 3. Acknowledge — assert pop_ready for one cycle to advance ring head
top_->host_comp_ready = 1;
tick();
top_->host_comp_ready = 0;
tick();
return true;
// If this function is NOT called: comp_valid stays high, worker stays
// in S_COMPLETE, desc_ring fills up, push_ready goes low. Correct.
}
host_push_valid, the RTL combinational logic needs
one clock edge to settle before the next port read reflects the updated ring state.
Skipping the idle tick can cause the bridge to see stale push_ready
values and submit to a full ring.
Descriptor field mapping
| C++ argument | RTL port | desc_pkg field | Width | Notes |
|---|---|---|---|---|
rollout_id | host_desc_rollout_id | desc_t.rollout_id | 16b | trajectory identifier, echoed in completion |
seq_len | host_desc_seq_len | desc_t.seq_len | 16b | sequence length at dispatch time |
max_tokens | host_desc_max_tokens | desc_t.max_tokens | 16b | generation budget; FSM terminates when reached |
reward_model_id | host_desc_reward_model_id | desc_t.reward_model_id | 16b | echoed in REWARD_NEEDED completion |
kv_arena_id | host_desc_kv_arena_id | desc_t.kv_arena_id | 16b | KV cache slab selector |
kv_offset | host_desc_kv_offset | desc_t.kv_offset | 32b | byte offset within KV arena |
delta_offset | host_desc_delta_offset | desc_t.delta_offset | 32b | new-token buffer offset |
| hardcoded | host_desc_opcode | DESC_OP_DECODE = 8'd1 | 8b | fixed for submit_decode; other ops need separate methods |
memcpy from the C struct into the RTL input port and the
bit layout is verified to be identical on both sides.
The tests prove the control-plane loop — all the way around.
Three tests cover the three protocol behaviors that matter: the happy path, the RL-specific reward transition, and the backpressure case that software tests almost always skip.
Basic decode → DONE
Submit rollout 7 with max_tokens=10, reward_model_id=1. Tick until poll_completion returns true. Assert status==0x01, rollout_id==7, final_seq_len==10. Verifies the full IDLE→DECODE→COMPLETE path and that the bridge reads outputs correctly.
Reward boundary → REWARD_NEEDED
Submit with max_tokens=64. The FSM emits REWARD_NEEDED at the configured 32-token boundary before reaching max_tokens. Assert status==0x02 at final_seq_len==32, then a second completion with status==0x01 at final_seq_len==64. Verifies the RL-specific mid-rollout completion path.
Backpressure → no lost completions
Submit a descriptor. Hold host_comp_ready=0 for 20 ticks after the worker reaches S_COMPLETE. Worker must remain in S_COMPLETE; comp_valid must stay asserted. Release host_comp_ready. Assert completion is received correctly. Verifies that the completion ring never silently drops.
C++sim/run_sim.cpp — test runner sketch
// test 1: basic decode
bridge.reset();
bool ok = bridge.submit_decode(/*rollout_id=*/7, /*seq_len=*/0, /*max_tokens=*/10, /*reward_model_id=*/1);
ASSERT(ok, "submit failed — push_ready was low?");
RtlCompletion c{};
for (int i = 0; i < 200 && !bridge.poll_completion(&c); ++i)
bridge.tick(); // tick until completion or timeout
ASSERT(c.status == 0x01, "expected DONE");
ASSERT(c.rollout_id == 7, "rollout_id mismatch");
ASSERT(c.final_seq_len == 10, "seq_len mismatch");
// test 3: backpressure — hold comp_ready low for 20 ticks
bridge.reset();
bridge.submit_decode(42, 0, 5, 1);
for (int i = 0; i < 100; ++i) bridge.tick(); // wait for worker to reach S_COMPLETE
bridge.top_->host_comp_ready = 0; // hold backpressure
for (int i = 0; i < 20; ++i) {
bridge.tick();
ASSERT(bridge.top_->host_comp_valid, "comp_valid must stay high");
}
bridge.top_->host_comp_ready = 1; // release — completion arrives
RtlCompletion c2{};
bool got = bridge.poll_completion(&c2);
ASSERT(got && c2.rollout_id == 42, "completion lost under backpressure");
What this does and does not prove.
It is easy to over-claim what a co-simulation result means. The bridge tests the control-plane protocol, not the inference computation. Here is the precise scope.
Control-plane contract
- Descriptor layout is correct: all fields map correctly from C++ to RTL ports
- Valid/ready handshake works in the submit direction
- Valid/ready handshake works in the completion direction
- Worker FSM reaches S_COMPLETE for DECODE descriptors
- Worker FSM emits REWARD_NEEDED at the correct token boundary
- Completion ring retains completions under backpressure (no silent drops)
- Reset correctly initializes all RTL state
- C++ bridge correctly maps field widths and opcode constants
Real inference execution
- Real transformer attention — no matrix math in RTL
- Tensor-core or CUDA kernel scheduling
- GB300 / Blackwell hardware behavior
- NVIDIA MMIO doorbell semantics (NVLink, PCIe BAR)
- HBM bandwidth or latency effects
- NVSwitch topology or multi-GPU routing
- Production inference serving throughput or tail latency
- Physical descriptor layout (current gap — see §Next)
Unify the descriptor into a 64-byte shared physical contract.
Right now the bridge maps fields one-by-one onto individual RTL ports. This validates the logical contract — the field names, widths, and opcode values are consistent. What it does not validate is the physical layout: are the bits in the same position in the C struct as in the RTL packed descriptor?
In a real hardware system this matters. The CPU writes a cache-line-sized descriptor into DRAM. The DMA engine reads it. The hardware queue engine parses it. If the C struct and the RTL struct have different field ordering, endianness assumptions, or padding, the hardware will misparse every work order. The next milestone closes this gap.
target contractshared physical layout — v0.6 goal
// Both sides must agree on every bit of this layout
C: __attribute__((packed, aligned(64))) hw_desc_t → 64 bytes, little-endian
RTL: typedef struct packed { ... } desc_t → 512 bits, same field order
Validation strategy:
1. Static assert: sizeof(hw_desc_t) == 64
2. Static assert: offsetof(hw_desc_t, rollout_id) == 2 // after opcode + flags
3. Verilator bridge: single DPI call with raw uint8_t[64] bus instead of N ports
4. Co-sim test: write known byte pattern, assert RTL parses fields correctly
5. Reverse test: RTL emits completion; assert C struct parses it correctly
C + Verilator co-simulation — current milestone
Logical field mapping. C++ bridge drives RTL ports directly. Three contract tests: decode, reward boundary, backpressure. All pass.
64-byte physical descriptor layout
Shared bit-level layout between C struct and RTL packed struct. Single raw bus in bridge. Static-assert validation. C and RTL must agree on every byte position and endianness assumption.
Multi-worker dispatch
Descriptor ring → round-robin dispatcher → N rollout worker FSMs → completion merge arbiter → completion ring. Per-worker token counters. M concurrent trajectories across N workers. Completion ordering tests.
AXI4-Lite MMIO + MSI-X interrupt
Replace bare MMIO block with proper AXI4-Lite slave interface. Add interrupt generator: assert IRQ line when completion ring is non-empty and interrupt enable bit is set. Host can switch from polling to interrupt-driven completion handling.
The repo is becoming a contract lab.
After building the co-simulation bridge, the relationship between the C runtime and the RTL engine changes. They are no longer two separate things that happen to use the same vocabulary. They are two implementations of one contract — and the bridge is the test harness that proves they agree.
Submit work
How do I submit rollout descriptors to a hardware queue without CPU involvement per token? Write the descriptor, check push_ready, assert push_valid, poll for completion.
Consume work
What hardware queue protocol should accept descriptors, simulate rollout progression, emit reward checkpoints, and propagate backpressure — all without software assistance mid-rollout?
Prove contract
Can the software runtime and the RTL engine agree on that protocol at the bit level, validated by executable tests that exercise the happy path, reward path, and backpressure path?