Hardware / Software Co-Design · Systems Note

C to RTL:
Co-Simulating an RL
Inference Control Plane

gb300-rl-runtime | SystemVerilog + Verilator | ~20 min read

A standalone C runtime can over-assume. A standalone RTL block can over-assume. Co-simulation is the executable proof that they agree — on descriptor layout, valid/ready handshakes, FSM progression, completion semantics, and backpressure. This is how gb300-rl-runtime builds that proof.

Verilator Descriptor Contract Ready/Valid Worker FSM Completion Ring Backpressure DPI Bridge 64-byte Layout

Scope — This does not prove real transformer execution. It proves the narrower, important thing: that host software and the RTL control-plane engine share the same descriptor protocol, the same completion semantics, and the same backpressure behavior — verified by executable tests.

co-simulation architecture

Why co-simulation matters

A standalone C runtime and a standalone RTL block can both lie to themselves.

When you test them separately, each side controls its own assumptions. The C runtime can write a test that assumes the device always accepts descriptors. The RTL testbench can write a stimulus that drives host_comp_ready high every cycle. Both tests pass. Both are wrong.

Co-simulation removes that escape hatch. When the C++ bridge drives the Verilated RTL model directly, the C side must handle the case where push_ready is low. The RTL side must handle the case where the host holds pop_ready low for many cycles. Neither side can invent capacity the other doesn't have.

C-side over-assumptions

What software silently assumes

Always-ready ring: the device always has room for another descriptor — no check of push_ready
Instant completions: the completion appears within a bounded number of ticks — no timeout handling
Implicit encoding: the C struct field order and the RTL packed struct order are the same — no layout verification
Status constants: 0x01=DONE, 0x02=REWARD_NEEDED match on both sides — no enum validation
Backpressure doesn't happen: host_comp_ready=1 forever in unit tests — completion ring never fills

RTL-side over-assumptions

What hardware silently assumes

Well-formed push_valid: host never asserts valid with an incomplete or malformed descriptor
Ready respected: host checks push_ready before each submit — never fires when ring is full
Completion drained promptly: host holds pop_ready high — completion ring never stalls the worker FSM
Field widths correct: rollout_id fits in 16 bits, max_tokens is non-zero, opcode is a valid enum value
Reset before use: host always drives reset before the first descriptor submission

Co-simulation makes the interface executable. The host must drive the RTL correctly, and the RTL must expose completions and backpressure honestly — or the test fails.

The practical consequence: when the co-sim tests pass, you have a proof that the descriptor layout is correct, the valid/ready handshakes work in both directions, the FSM reaches every completion state, and backpressure propagates end-to-end. That proof is cheap to get now and expensive to discover missing in silicon.

The vertical stack

The repo now spans runtime code, RTL, and a co-simulation bridge.

Each layer has a single job. The C/CUDA runtime focuses on efficient descriptor construction and completion polling. The RTL engine owns queue mechanics and FSM progression. The Verilator bridge is the adhesive — a thin, boring piece of code whose only job is to faithfully map one side's types onto the other's port signals.

full data path — left to right

①

C/CUDA Runtime

Builds hw_desc_t. Calls infer_submit_decode(). Polls for completions. Handles reward checkpoints.

②

Verilator Bridge

Maps C++ struct fields onto RTL input ports. Calls eval() each tick. Reads output ports into RtlCompletion.

③

RTL Engine

mmio_regs → desc_ring → rollout_worker_fsm → completion_ring. All connected in rl_runtime_top.

④

Completion Path

Worker emits comp_valid. Ring stalls if host holds pop_ready=0. Host polls pop_valid via bridge.

⑤

Contract Tests

Decode → DONE. Decode → REWARD_NEEDED. Backpressure hold → no lost completions. All must pass.

    co-sim path
sim/run_sim.cpp
    submit_decode(rollout_id, seq_len, max_tokens, ...)
  ↓  bridge maps C++ → RTL input ports
Verilated rl_runtime_top::eval()          // advance one clock
  ↓  mmio_regs: doorbell_pulse asserted
  ↓  desc_ring: push_valid && push_ready → latch descriptor
  ↓  worker_fsm: S_IDLE → S_DECODE → token_count++
  ↓  worker_fsm: token_count >= max_tokens → S_COMPLETE
  ↓  completion_ring: comp_valid; stall if !comp_ready
  ↓  bridge reads output ports
poll_completion() → RtlCompletion{ rollout_id, status, final_seq_len }

Repo layout: rtl/ holds all SystemVerilog source. sim/ holds the Verilator bridge (cosim_bridge.cpp), the C host API (infer_api.h), and the test driver (run_sim.cpp). tb/ holds the standalone SystemVerilog testbench for RTL-only verification. The bridge and the testbench test the same RTL, from opposite sides of the interface.

Bridge API

The bridge should be small, boring, and contract-focused.

The bridge class has five responsibilities and nothing else: initialize the Verilated model, drive reset, tick the clock, map descriptors onto input ports, and read completions from output ports. Any logic beyond that belongs in either the C runtime layer or the RTL layer.

    C++
sim/cosim_bridge.h
    struct RtlCompletion {
    uint16_t rollout_id;
    uint8_t  status;           // 0x01=DONE, 0x02=REWARD_NEEDED, 0xFF=ERROR
    uint16_t final_seq_len;
    uint16_t reward_id;
};

class RtlRuntimeBridge {
public:
    RtlRuntimeBridge();                         // construct + allocate Verilated model
    ~RtlRuntimeBridge();                        // delete Verilated model

    void reset(unsigned cycles = 5);            // drive rst_n=0 for N ticks, then release
    void tick();                                  // advance clk 0→1→0, call eval() each edge

    bool submit_decode(
        uint16_t rollout_id,
        uint16_t seq_len,
        uint16_t max_tokens,
        uint16_t reward_model_id,
        uint16_t kv_arena_id   = 0,
        uint16_t prefix_id     = 0,
        uint32_t kv_offset     = 0,
        uint32_t delta_offset  = 0
    );                                            // returns false if push_ready=0 (ring full)

    bool     poll_completion(RtlCompletion *out); // returns true + fills *out if pop_valid=1
    uint64_t cycles() const;                     // monotonic cycle counter for timing budgets
private:
    Vrl_runtime_top  *top_;                      // Verilator-generated model class
    VerilatedContext *ctx_;                      // Verilator simulation context
    uint64_t          cycle_;
};

reset()

Known state

Drives rst_n=0 for N rising edges. All RTL registers clear. Ring head/tail reset to zero. FSM returns to S_IDLE. Required before the first submit_decode call — the bridge constructor calls it automatically.

submit_decode()

Drive descriptor

Checks push_ready first. If low, returns false — caller must retry or yield. If high, maps each C++ argument onto its RTL input port, asserts push_valid for one cycle, then de-asserts. Returns true on successful handshake.

poll_completion()

Observe result

Samples pop_valid. If low, returns false. If high, reads all host_comp_* output ports into *out, asserts pop_ready for one cycle to acknowledge, then returns true. Never invents a completion.

Bridge implementation

The inner loop is three lines. The discipline is in the handshakes.

Most of the bridge implementation is mechanical field mapping. The interesting parts are the handshake discipline in submit_decode, the clock-edge sequencing inside tick, and the backpressure behavior that prevents lost completions.

    C++
sim/cosim_bridge.cpp — tick()
    // One full clock cycle: rising edge eval, then falling edge eval.
// Verilator requires eval() on each edge for registered outputs to propagate.
void RtlRuntimeBridge::tick() {
    top_->clk = 1;
    top_->eval();                 // rising edge: flip-flops sample inputs
    top_->clk = 0;
    top_->eval();                 // falling edge: combinational logic re-resolves
    ++cycle_;
}

    C++
sim/cosim_bridge.cpp — submit_decode()
    bool RtlRuntimeBridge::submit_decode(
    uint16_t rollout_id, uint16_t seq_len, uint16_t max_tokens,
    uint16_t reward_model_id, uint16_t kv_arena_id,
    uint16_t prefix_id, uint32_t kv_offset, uint32_t delta_offset)
{
    // 1. Check backpressure — do not drive valid if ring is full
    if (!top_->push_ready) return false;

    // 2. Map C++ arguments onto RTL descriptor input ports
    top_->host_desc_opcode          = DESC_OP_DECODE;   // 8'd1
    top_->host_desc_flags           = 0;
    top_->host_desc_rollout_id       = rollout_id;
    top_->host_desc_kv_arena_id      = kv_arena_id;
    top_->host_desc_prefix_id        = prefix_id;
    top_->host_desc_kv_offset        = kv_offset;
    top_->host_desc_delta_offset     = delta_offset;
    top_->host_desc_seq_len          = seq_len;
    top_->host_desc_max_tokens       = max_tokens;
    top_->host_desc_reward_model_id  = reward_model_id;
    top_->host_desc_reserved         = 0;

    // 3. Assert push_valid for one clock — descriptor is latched on rising edge
    top_->host_push_valid = 1;
    tick();
    top_->host_push_valid = 0;       // de-assert; ring owns the slot now
    tick();                           // one idle tick before next operation
    return true;
}

    C++
sim/cosim_bridge.cpp — poll_completion()
    bool RtlRuntimeBridge::poll_completion(RtlCompletion *out) {
    // 1. Check if RTL has a completion ready — non-blocking poll
    if (!top_->host_comp_valid) return false;

    // 2. Read completion fields from RTL output ports
    out->rollout_id     = top_->host_comp_rollout_id;
    out->status         = top_->host_comp_status;
    out->final_seq_len  = top_->host_comp_final_seq_len;
    out->reward_id      = top_->host_comp_reward_id;

    // 3. Acknowledge — assert pop_ready for one cycle to advance ring head
    top_->host_comp_ready = 1;
    tick();
    top_->host_comp_ready = 0;
    tick();
    return true;
    // If this function is NOT called: comp_valid stays high, worker stays
    // in S_COMPLETE, desc_ring fills up, push_ready goes low. Correct.
}

Why two ticks after de-assert? Verilator is cycle-accurate. After de-asserting host_push_valid, the RTL combinational logic needs one clock edge to settle before the next port read reflects the updated ring state. Skipping the idle tick can cause the bridge to see stale push_ready values and submit to a full ring.

Descriptor field mapping

C++ argument	RTL port	desc_pkg field	Width	Notes
`rollout_id`	`host_desc_rollout_id`	`desc_t.rollout_id`	16b	trajectory identifier, echoed in completion
`seq_len`	`host_desc_seq_len`	`desc_t.seq_len`	16b	sequence length at dispatch time
`max_tokens`	`host_desc_max_tokens`	`desc_t.max_tokens`	16b	generation budget; FSM terminates when reached
`reward_model_id`	`host_desc_reward_model_id`	`desc_t.reward_model_id`	16b	echoed in REWARD_NEEDED completion
`kv_arena_id`	`host_desc_kv_arena_id`	`desc_t.kv_arena_id`	16b	KV cache slab selector
`kv_offset`	`host_desc_kv_offset`	`desc_t.kv_offset`	32b	byte offset within KV arena
`delta_offset`	`host_desc_delta_offset`	`desc_t.delta_offset`	32b	new-token buffer offset
hardcoded	`host_desc_opcode`	`DESC_OP_DECODE = 8'd1`	8b	fixed for submit_decode; other ops need separate methods

Physical layout target: This is currently a logical mapping — each field is wired individually onto separate RTL ports. The next step is a physical layout: pack all fields into a single 192-bit (current) or 512-bit (target) port so the bridge can do one memcpy from the C struct into the RTL input port and the bit layout is verified to be identical on both sides.

Co-simulation tests

The tests prove the control-plane loop — all the way around.

Three tests cover the three protocol behaviors that matter: the happy path, the RL-specific reward transition, and the backpressure case that software tests almost always skip.

test 1

Basic decode → DONE

Submit rollout 7 with max_tokens=10, reward_model_id=1. Tick until poll_completion returns true. Assert status==0x01, rollout_id==7, final_seq_len==10. Verifies the full IDLE→DECODE→COMPLETE path and that the bridge reads outputs correctly.

test 2

Reward boundary → REWARD_NEEDED

Submit with max_tokens=64. The FSM emits REWARD_NEEDED at the configured 32-token boundary before reaching max_tokens. Assert status==0x02 at final_seq_len==32, then a second completion with status==0x01 at final_seq_len==64. Verifies the RL-specific mid-rollout completion path.

test 3

Backpressure → no lost completions

Submit a descriptor. Hold host_comp_ready=0 for 20 ticks after the worker reaches S_COMPLETE. Worker must remain in S_COMPLETE; comp_valid must stay asserted. Release host_comp_ready. Assert completion is received correctly. Verifies that the completion ring never silently drops.

    C++
sim/run_sim.cpp — test runner sketch
    // test 1: basic decode
bridge.reset();
bool ok = bridge.submit_decode(/*rollout_id=*/7, /*seq_len=*/0, /*max_tokens=*/10, /*reward_model_id=*/1);
ASSERT(ok, "submit failed — push_ready was low?");

RtlCompletion c{};
for (int i = 0; i < 200 && !bridge.poll_completion(&c); ++i)
    bridge.tick();                           // tick until completion or timeout
ASSERT(c.status        == 0x01, "expected DONE");
ASSERT(c.rollout_id    == 7,    "rollout_id mismatch");
ASSERT(c.final_seq_len == 10,   "seq_len mismatch");

// test 3: backpressure — hold comp_ready low for 20 ticks
bridge.reset();
bridge.submit_decode(42, 0, 5, 1);
for (int i = 0; i < 100; ++i) bridge.tick();   // wait for worker to reach S_COMPLETE
bridge.top_->host_comp_ready = 0;               // hold backpressure
for (int i = 0; i < 20; ++i) {
    bridge.tick();
    ASSERT(bridge.top_->host_comp_valid, "comp_valid must stay high");
}
bridge.top_->host_comp_ready = 1;               // release — completion arrives
RtlCompletion c2{};
bool got = bridge.poll_completion(&c2);
ASSERT(got && c2.rollout_id == 42,             "completion lost under backpressure");

expected output

// RTL co-simulation suite — gb300-rl-runtime

RTL co-sim [test 1]: basic decode — rollout_id=7, status=DONE, final_seq_len=10 — PASS

RTL co-sim [test 2]: reward boundary — REWARD_NEEDED@32, DONE@64 — PASS

RTL co-sim [test 3]: completion backpressure — comp_valid held 20 ticks, no loss — PASS

RTL bridge contract tests: 3/3 PASS

A SystemVerilog testbench asks: does the RTL work by itself? A C++ Verilator bridge asks: can host software drive the RTL the way it expects to drive hardware? You need both questions answered.

Honesty boundary

What this does and does not prove.

It is easy to over-claim what a co-simulation result means. The bridge tests the control-plane protocol, not the inference computation. Here is the precise scope.

Proven ✓

Control-plane contract

Descriptor layout is correct: all fields map correctly from C++ to RTL ports
Valid/ready handshake works in the submit direction
Valid/ready handshake works in the completion direction
Worker FSM reaches S_COMPLETE for DECODE descriptors
Worker FSM emits REWARD_NEEDED at the correct token boundary
Completion ring retains completions under backpressure (no silent drops)
Reset correctly initializes all RTL state
C++ bridge correctly maps field widths and opcode constants

Not proven ✗

Real inference execution

Real transformer attention — no matrix math in RTL
Tensor-core or CUDA kernel scheduling
GB300 / Blackwell hardware behavior
NVIDIA MMIO doorbell semantics (NVLink, PCIe BAR)
HBM bandwidth or latency effects
NVSwitch topology or multi-GPU routing
Production inference serving throughput or tail latency
Physical descriptor layout (current gap — see §Next)

Next milestone

Unify the descriptor into a 64-byte shared physical contract.

Right now the bridge maps fields one-by-one onto individual RTL ports. This validates the logical contract — the field names, widths, and opcode values are consistent. What it does not validate is the physical layout: are the bits in the same position in the C struct as in the RTL packed descriptor?

In a real hardware system this matters. The CPU writes a cache-line-sized descriptor into DRAM. The DMA engine reads it. The hardware queue engine parses it. If the C struct and the RTL struct have different field ordering, endianness assumptions, or padding, the hardware will misparse every work order. The next milestone closes this gap.

    target contract
shared physical layout — v0.6 goal
    // Both sides must agree on every bit of this layout

C:   __attribute__((packed, aligned(64))) hw_desc_t   → 64 bytes, little-endian
RTL: typedef struct packed { ... } desc_t            → 512 bits, same field order

Validation strategy:
  1. Static assert: sizeof(hw_desc_t) == 64
  2. Static assert: offsetof(hw_desc_t, rollout_id) == 2  // after opcode + flags
  3. Verilator bridge: single DPI call with raw uint8_t[64] bus instead of N ports
  4. Co-sim test: write known byte pattern, assert RTL parses fields correctly
  5. Reverse test: RTL emits completion; assert C struct parses it correctly

v0.5

C + Verilator co-simulation — current milestone

Logical field mapping. C++ bridge drives RTL ports directly. Three contract tests: decode, reward boundary, backpressure. All pass.

v0.6

64-byte physical descriptor layout

Shared bit-level layout between C struct and RTL packed struct. Single raw bus in bridge. Static-assert validation. C and RTL must agree on every byte position and endianness assumption.

v0.7

Multi-worker dispatch

Descriptor ring → round-robin dispatcher → N rollout worker FSMs → completion merge arbiter → completion ring. Per-worker token counters. M concurrent trajectories across N workers. Completion ordering tests.

v0.8

AXI4-Lite MMIO + MSI-X interrupt

Replace bare MMIO block with proper AXI4-Lite slave interface. Add interrupt generator: assert IRQ line when completion ring is non-empty and interrupt enable bit is set. Host can switch from polling to interrupt-driven completion handling.

Final mental model

The repo is becoming a contract lab.

After building the co-simulation bridge, the relationship between the C runtime and the RTL engine changes. They are no longer two separate things that happen to use the same vocabulary. They are two implementations of one contract — and the bridge is the test harness that proves they agree.

C runtime

Submit work

How do I submit rollout descriptors to a hardware queue without CPU involvement per token? Write the descriptor, check push_ready, assert push_valid, poll for completion.

RTL engine

Consume work

What hardware queue protocol should accept descriptors, simulate rollout progression, emit reward checkpoints, and propagate backpressure — all without software assistance mid-rollout?

Verilator bridge

Prove contract

Can the software runtime and the RTL engine agree on that protocol at the bit level, validated by executable tests that exercise the happy path, reward path, and backpressure path?

The bridge from runtime code to hardware-shaped inference design is the descriptor control-plane contract. Co-simulation is the proof it is real.

C to RTL:Co-Simulating an RLInference Control Plane

A standalone C runtime and a standalone RTL block can both lie to themselves.

What software silently assumes

What hardware silently assumes

The repo now spans runtime code, RTL, and a co-simulation bridge.

The bridge should be small, boring, and contract-focused.

Known state

Drive descriptor

Observe result

The inner loop is three lines. The discipline is in the handshakes.

Descriptor field mapping

The tests prove the control-plane loop — all the way around.

Basic decode → DONE

Reward boundary → REWARD_NEEDED

Backpressure → no lost completions

What this does and does not prove.

Control-plane contract

Real inference execution

Unify the descriptor into a 64-byte shared physical contract.

C + Verilator co-simulation — current milestone

64-byte physical descriptor layout

Multi-worker dispatch

AXI4-Lite MMIO + MSI-X interrupt

The repo is becoming a contract lab.

Submit work

Consume work

Prove contract

C to RTL:
Co-Simulating an RL
Inference Control Plane