From Runtime Rings to RTL
Building a hardware queue engine for RL inference — descriptor rings, doorbells, rollout worker FSMs, completion queues, backpressure, and Verilator co-simulation. A systems-level deep dive into why your RL inference runtime is already shaped like hardware.
RTL forces the runtime contract to become explicit.
Reinforcement learning inference has a different execution pattern than standard transformer serving. Rollouts are long, sequential, and reward-gated. The runtime cannot simply batch tokens once and return — it must manage per-rollout state across hundreds to thousands of decode steps, observe intermediate reward signals, and decide whether to continue, checkpoint, or terminate each trajectory.
The C/CUDA runtime already has opinions about how this should work: write a descriptor, publish a tail pointer, ring a doorbell, and avoid CPU micromanagement of every token step. But software conventions are fuzzy. The RTL model asks the harder question: what hardware queue protocol makes this execution model provably correct?
Who owns the queue?
Head and tail ownership become hardware-visible state, not just informal conventions inside helper functions. The ring invariant is enforced by RTL, not by programmer discipline.
When is work visible?
The doorbell is not magic. Descriptors must be fully written and released before the device observes the published tail. RTL makes the ordering contract concrete and testable.
What if completion is full?
RTL forces a real answer: stall the worker, route to overflow, or propagate ready/valid backpressure up the chain. Software can paper over this; hardware cannot.
Why RL inference specifically?
Standard inference (one prompt → one response) is stateless from the queue's perspective. RL rollouts are fundamentally different:
- Per-trajectory state — each rollout has a KV cache arena, a sequence length counter, and a reward model assignment that must persist across thousands of decode steps.
- Reward checkpoints — the worker must emit intermediate
REWARD_NEEDEDcompletions at configurable intervals (e.g. every 32 tokens), not just a finalDONE. - Heterogeneous work — a single descriptor ring may carry
DECODEops,REWARDscoring ops, andNOPbarriers interleaved by the runtime scheduler. - Long-horizon backpressure — if the reward model lags, the rollout worker must stall cleanly without corrupting queue state or dropping completions.
These requirements map directly onto hardware queue semantics: typed descriptors, FSM-driven progress, completion rings with backpressure, and a shared descriptor contract between CPU and device.
A descriptor engine, not a GPU.
The goal is not to implement transformer attention, tensor cores, HBM scheduling, NVLink arbitration, or GB300 fabric internals. The goal is to model the hardware queue protocol that the software fast path wants to drive — and to prove it is correct by construction.
desc_t with rollout_id, kv_arena, seq_len, max_tokens, reward_model_idCONTROL-PLANE INVARIANT// The queue contract in one diagram: descriptor_in → ownership_transfer (tail doorbell) → worker_state_machine (progress simulation) → completion_out (done | reward_needed) → backpressure_if_full (stall, never drop)
This is exactly the contract that PCIe NVMe drives, GPUs, and DMA engines expose to their host software stacks. Building it in RTL — even as a simulation model — means the software contract is no longer a comment in a header file. It is a synthesizable specification.
The C fast path is already RTL-shaped.
Look at the runtime submit path and you will see that every software primitive maps directly onto a hardware concept. This is not coincidence — it is the consequence of optimizing software until the only overhead left is the minimum required by the hardware protocol.
C FAST PATH — SUBMIT// Three lines. This is the entire "hot path". q->entries[tail & RING_MASK] = desc; // write descriptor into ring slot __atomic_store_n(&q->tail, tail + 1, __ATOMIC_RELEASE);// release ownership to device mmio_write32(doorbell, tail + 1); // pulse the doorbell — work is visible
The atomic release store is the software equivalent of a write-barrier before
the doorbell. Without it, the device may observe the doorbell increment before
the descriptor bytes are coherent. In RTL, this ordering is enforced structurally:
the doorbell_pulse signal is only asserted on the clock edge
after the MMIO write, and the ring consumer only samples
pop_valid after the pulse.
| C Runtime | RTL Module | Semantic contract | Why it matters |
|---|---|---|---|
hw_desc_t | desc_pkg::desc_t | Fixed-width work order | One cache-line → one ring slot. Atomic visibility. |
hw_ring_push() | desc_ring | Producer/consumer ownership | Full/empty flags prevent overwrite and starvation. |
mmio_write32(doorbell) | doorbell_pulse | Device-visible notification | Work is not visible until tail is published. |
infer_submit_decode() | rollout_worker_fsm | Descriptor-driven FSM | Device progresses without CPU involvement per token. |
| completion poll loop | completion_ring | Host observes device progress | Backpressure propagates; host never misses a completion. |
ATOMIC_RELEASE | doorbell clocked after write | Memory ordering | Descriptor bytes coherent before device sees tail. |
Small modules. Clear ownership. Composable interfaces.
Good RTL design favors narrow, well-defined modules with explicit port contracts over monolithic blocks. Each module below owns exactly one concept. Interfaces between modules are always ready/valid handshakes.
REPO LAYOUTrtl/ desc_pkg.sv // descriptor types, opcodes, completion types mmio_regs.sv // MMIO write decoder, doorbell pulse generation desc_ring.sv // parameterized FIFO/ring with push_ready/pop_valid completion_ring.sv // completion FIFO; stalls worker if full rollout_worker_fsm.sv// IDLE→DECODE→COMPLETE FSM; emits completions rl_runtime_top.sv // top-level module; wires all submodules together tb_rl_runtime_top.sv // SystemVerilog testbench sim/ infer_api.h // C host interface matching RTL descriptor contract cosim_bridge.cpp // Verilator DPI bridge run_sim.cpp // C test driver
desc_pkg.sv
Defines the shared vocabulary: desc_opcode_t enum, the desc_t packed struct (work order), and the completion_t packed struct (result). Every module imports this package. Changing a field here forces agreement everywhere — the compiler enforces the contract.
mmio_regs.sv
Decodes MMIO write strobe + address. When the CPU writes the doorbell address, asserts doorbell_pulse for exactly one clock cycle and latches doorbell_value (the new published tail). All other addresses go to status/config registers.
desc_ring.sv
Parameterized power-of-two FIFO. Exports push_ready, push_valid, pop_valid, pop_ready with fully registered head/tail pointers. MSB-extended pointers give unambiguous full/empty detection even when indices alias.
completion_ring.sv
Mirror image of desc_ring, but in the reverse direction. The worker pushes completions; the host polls and pops. When full, push_ready goes low and the worker FSM stalls in S_COMPLETE. No silent drops, ever.
rollout_worker_fsm.sv
The heart of the control plane. Accepts one descriptor at a time, simulates token-by-token progress (one token per clock in RTL — mapped to real GPU latency in co-simulation), and emits DONE or REWARD_NEEDED completions.
rl_runtime_top.sv
Structural module only — no logic, just port wiring. Connects MMIO doorbell to the descriptor ring push path, the ring pop path to the worker FSM, and the worker completion output to the completion ring. The architecture is visible at a glance.
The descriptor is the hardware work order.
A descriptor is a fixed-width, packed struct that fits in one cache line. It is the contract between the CPU runtime and the device queue. Every field has a meaning; no field is optional; all must be present when the descriptor is pushed onto the ring.
SYSTEMVERILOG — desc_pkg.svpackage desc_pkg; // Opcodes identify the type of work in each descriptor typedef enum logic [7:0] { DESC_OP_NOP = 8'd0, // barrier / padding DESC_OP_DECODE = 8'd1, // rollout decode step DESC_OP_REWARD = 8'd2, // reward model evaluation DESC_OP_STOP = 8'd255 // terminate rollout } desc_opcode_t; // Primary work descriptor — 192 bits (future: pad to 512b for cache-line alignment) typedef struct packed { logic [7:0] opcode; // operation type logic [7:0] flags; // stream bits, priority, stop-on-reward logic [15:0] rollout_id; // trajectory identifier (host-assigned) logic [15:0] kv_arena_id; // which KV cache slab owns this rollout logic [15:0] prefix_id; // shared prompt prefix (dedup index) logic [31:0] kv_offset; // byte offset into KV arena logic [31:0] delta_offset; // delta / new-token buffer offset logic [15:0] seq_len; // current sequence length at dispatch logic [15:0] max_tokens; // generation budget (EOS or budget exceeded) logic [15:0] reward_model_id;// which reward head to evaluate against logic [15:0] reserved; // pad; future: checkpoint interval } desc_t; // Completion payload — what the device returns to the host typedef struct packed { logic [15:0] rollout_id; // echoed from descriptor logic [7:0] status; // 8'h01=done, 8'h02=reward_needed, 8'hFF=error logic [15:0] final_seq_len; // sequence length at completion logic [15:0] reward_id; // echoed reward_model_id } completion_t; endpackage
desc_t to a true 512-bit (64-byte) descriptor
so each ring slot occupies exactly one CPU cache line. This aligns RTL ring semantics with
the C hardware descriptor, prevents false sharing, and makes the MMIO write path
cache-friendly on the host side.
Why kv_arena_id + kv_offset?
RL inference may run hundreds of concurrent trajectories sharing a large KV cache pool. The arena ID selects a pre-allocated slab; the offset indexes within it. This two-level addressing avoids per-trajectory memory management in hardware and mirrors how CUDA stream memory arenas work.
Why reward_model_id in the descriptor?
Different rollout phases may use different reward heads (process reward vs. outcome reward vs. verifier). Encoding the model ID in the descriptor lets a future multi-worker dispatch engine route reward evaluation to the correct accelerator without re-reading host-side metadata.
A doorbell turns a write into work visibility.
The doorbell is one of the most important concepts in hardware queue design. It is not "pressing a button." It is a precise protocol: the host writes descriptors, releases them with a memory barrier, then writes the doorbell register. Only after the doorbell write does the device consider any of the new descriptors visible.
In NVMe, this is the Submission Queue Tail Doorbell register at offset
0x1000 + 2*(2y)*DSTRD. In GPU command channels, it is the
put pointer write. In this RTL model, it is a single MMIO write to address
0x10 that asserts doorbell_pulse for exactly one clock cycle.
SYSTEMVERILOG — mmio_regs.svmodule mmio_regs #( parameter int ADDR_W = 8 )( input logic clk, input logic rst_n, input logic wr_en, input logic [ADDR_W-1:0] wr_addr, input logic [31:0] wr_data, output logic doorbell_pulse, // one cycle strobe output logic [31:0] doorbell_value, // new published tail output logic [31:0] status_reg // read-back for host polling ); localparam DOORBELL_ADDR = 8'h10; localparam STATUS_ADDR = 8'h20; always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin doorbell_pulse <= 1'b0; doorbell_value <= 32'd0; status_reg <= 32'd0; end else begin doorbell_pulse <= 1'b0; // default: not asserted if (wr_en) begin case (wr_addr) DOORBELL_ADDR: begin doorbell_value <= wr_data; doorbell_pulse <= 1'b1; // assert for exactly one cycle end STATUS_ADDR: status_reg <= wr_data; endcase end end end endmodule
The v0.2 RTL improvement makes this more faithful to real hardware: descriptors
are written into a shared-memory region, and the ring consumer only advances past
descriptors with index < doorbell_value. This matches the NVMe
Submission Queue model exactly and prevents the device from consuming partially
written descriptors.
Queue ownership becomes hardware state.
The classic power-of-two FIFO ring is one of the most important data structures in hardware design. Its invariants are simple, its implementation is small, and its correctness properties are provable by inspection.
RING INVARIANT — ALWAYS TRUE// Producer owns: [tail .. wrap) — slots it may write // Consumer owns: [head .. tail) — slots it may read // // empty : head == tail // full : (tail[PTR_W-1:0] == head[PTR_W-1:0]) && (tail[PTR_W] != head[PTR_W]) // count : tail - head (modulo 2^(PTR_W+1), but difference is always in [0, DEPTH]) producer writes tail // never reads head to compute write address consumer writes head // never reads tail to compute read address producer reads head // to determine free slots before push consumer reads tail // to determine available work before pop
SYSTEMVERILOG — desc_ring.sv (core)module desc_ring import desc_pkg::*; #( parameter int DEPTH = 16, parameter int PTR_W = $clog2(DEPTH) )( input logic clk, rst_n, input logic push_valid, input desc_t push_desc, output logic push_ready, // 0 = ring full, stall producer input logic pop_ready, output logic pop_valid, // 0 = ring empty, stall consumer output desc_t pop_desc, output logic [PTR_W:0] fill_level // for status register / overflow counter ); desc_t mem [DEPTH]; logic [PTR_W:0] head, tail; // one extra bit for full/empty disambiguation assign push_ready = !full; assign pop_valid = !empty; assign pop_desc = mem[head[PTR_W-1:0]]; assign fill_level = tail - head; wire empty = (head == tail); wire full = ((tail[PTR_W-1:0] == head[PTR_W-1:0]) && (tail[PTR_W] != head[PTR_W])); always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin head <= '0; tail <= '0; end else begin if (push_valid && push_ready) begin // handshake: both sides agree mem[tail[PTR_W-1:0]] <= push_desc; tail <= tail + 1'b1; end if (pop_valid && pop_ready) begin // consumer signals readiness head <= head + 1'b1; end end end endmodule
push_valid && push_ready
must be true simultaneously for a transfer to occur. This is the AXI/TileLink
handshake pattern. Neither side can proceed unilaterally — the protocol is immune
to race conditions by construction.
The worker models rollout progression, not transformer math.
The worker FSM is the RTL model of what the GPU kernel does on the device side. In hardware simulation, it represents one decode step per clock. In Verilator co-simulation, it can be stretched to represent GPU latency by inserting wait states.
pop_valid from descriptor ring. Assert pop_ready. Latch descriptor into cur register on acceptance.token_count each clock. Check max_tokens and reward checkpoint interval. Emit completion when either threshold is reached.comp_valid to completion ring. Stall here until comp_ready is asserted (backpressure). Return to S_IDLE on successful handshake.SYSTEMVERILOG — rollout_worker_fsm.svtypedef enum logic [1:0] { S_IDLE = 2'b00, S_DECODE = 2'b01, S_COMPLETE = 2'b10 } fsm_state_t; always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin state <= S_IDLE; token_count <= '0; end else case (state) S_IDLE: begin if (pop_valid) begin // new descriptor available cur <= pop_desc; // latch into working register token_count <= '0; if (pop_desc.opcode == DESC_OP_DECODE) state <= S_DECODE; else if (pop_desc.opcode == DESC_OP_STOP) state <= S_COMPLETE; // emit DONE immediately // NOP: stay in IDLE, drain the slot end end S_DECODE: begin token_count <= token_count + 1'b1; if ((token_count + 1'b1) >= cur.max_tokens) begin // Budget exhausted — emit final DONE completion out_comp.rollout_id <= cur.rollout_id; out_comp.status <= 8'h01; // DONE out_comp.final_seq_len <= token_count + 1'b1; out_comp.reward_id <= cur.reward_model_id; state <= S_COMPLETE; end else if (((token_count + 1'b1) & REWARD_INTERVAL_MASK) == '0) begin // Reward checkpoint interval hit — emit REWARD_NEEDED out_comp.rollout_id <= cur.rollout_id; out_comp.status <= 8'h02; // REWARD_NEEDED out_comp.final_seq_len <= token_count + 1'b1; out_comp.reward_id <= cur.reward_model_id; state <= S_COMPLETE; // Worker will return to S_DECODE after host ACKs reward, // or to S_IDLE if rollout is terminated by scheduler. end end S_COMPLETE: begin if (comp_ready) begin // completion ring accepted the result state <= S_IDLE; end // If comp_ready=0: completion ring full — stall here. No dropped results. end endcase end // Output drive: valid when in S_COMPLETE and completion not yet accepted assign comp_valid = (state == S_COMPLETE); assign pop_ready = (state == S_IDLE) && pop_valid; assign comp_out = out_comp;
Backpressure is not optional
If the host is slow to drain the completion ring and the worker simply overwrote completions or silently dropped them, the runtime would lose track of reward checkpoints. The worker stalls in S_COMPLETE until comp_ready is asserted — the completion ring's backpressure propagates cleanly to the worker.
Round-robin dispatch (v0.3)
In v0.3, a dispatcher module will sit between the desc_ring and N worker FSMs. It pops a descriptor only when a worker is in S_IDLE. Completions from all workers merge into a single completion ring through an arbitration tree. Per-worker token counters become independent.
Do not silently lose completions.
The completion ring is structurally identical to the descriptor ring but flows in the opposite direction. It is equally important — and equally easy to get wrong. A naive implementation might overwrite old completions when the host is slow. This RTL model refuses to do that.
No silent overwrites
When push_ready is low, the worker stalls in S_COMPLETE. Completions are never overwritten. The host is guaranteed to see every result in the order they were generated.
Ready / valid handshake
The worker drives comp_valid; the ring drives comp_ready. Transfer happens only when both are asserted. This is standard AXI4-Stream protocol — the same handshake used in every production DMA engine.
Propagate, don't absorb
A full completion ring stalls the worker. A stalled worker stops popping from the descriptor ring. The descriptor ring fills up. The MMIO push path sees push_ready=0 and the C runtime's doorbell write is not accepted. End-to-end backpressure, by construction.
STATUS REGISTER — HOST POLLING// Host can read these status registers at any time: STATUS_REG[31:24] = worker_state; // IDLE=0, DECODE=1, COMPLETE=2 STATUS_REG[23:16] = desc_ring_fill; // how many descriptors queued STATUS_REG[15:8] = comp_ring_fill; // how many completions waiting STATUS_REG[7:0] = overflow_count; // saturating counter; 0=no drops
comp_ring_fill > 0. A production v0.6 extension would add an
MSI-X interrupt generator: assert the interrupt line when comp_valid goes
high and the host has enabled the interrupt enable bit in the status register. This eliminates
busy-waiting and is how NVMe drives signal completion.
This is where it becomes a hardware/software co-design lab.
Verilator compiles the SystemVerilog RTL to C++. A thin DPI bridge exposes the
RTL port map as a C struct. The C runtime calls infer_submit_decode()
and observes completions through the same descriptor contract — but the "device" is
now the RTL model, not a GPU.
This is not just a test harness. It is a proof that the software contract and the hardware contract are the same. Any divergence is a bug, caught at the boundary rather than in production silicon.
CO-SIM DATA PATHC infer_submit_decode()
→ Verilated rl_runtime_top (C++ class)
→ RTL: mmio_regs (doorbell_pulse)
→ RTL: desc_ring (push_valid / push_ready)
→ RTL: rollout_worker_fsm (S_IDLE → S_DECODE → S_COMPLETE)
→ RTL: completion_ring (comp_valid / comp_ready)
→ C infer_poll_completion() (host polls pop_valid)
→ completion_t returned to caller
C HOST INTERFACE — infer_api.h// Mirrors the RTL descriptor contract exactly typedef struct { uint8_t opcode; uint8_t flags; uint16_t rollout_id; uint16_t kv_arena_id; uint16_t prefix_id; uint32_t kv_offset; uint32_t delta_offset; uint16_t seq_len; uint16_t max_tokens; uint16_t reward_model_id; uint16_t reserved; } hw_desc_t; // must match desc_pkg::desc_t bit-for-bit typedef struct { uint16_t rollout_id; uint8_t status; // 0x01=done, 0x02=reward_needed, 0xFF=err uint16_t final_seq_len; uint16_t reward_id; } hw_completion_t;
C TEST DRIVER — run_sim.cpp (future v0.5)// Submit a decode descriptor infer_submit_decode(&ctx, /* rollout_id */ rollout_id, /* kv_arena_id */ 0, /* kv_offset */ kv_offset, /* delta_offset */ delta_offset, /* prefix_id */ prefix_id, /* seq_len */ seq_len, /* max_tokens */ max_tokens); // Poll for completion — tick the RTL model each iteration hw_completion_t completion; while (!infer_poll_completion(&ctx, &completion)) { verilator_tick(rtl_top); // advance RTL model one clock } // Inspect result if (completion.status == 0x02) { // reward needed schedule_reward_eval(completion.rollout_id, completion.reward_id); } else if (completion.status == 0x01) { // done finalize_trajectory(completion.rollout_id, completion.final_seq_len); }
From RTL toy to co-simulated control-plane engine.
Each version proves one more contract. Taken together, v0.1–v0.5 deliver a complete hardware/software co-design lab for RL inference control planes. v0.6+ extends toward production AXI and interrupt semantics.
Basic descriptor engine
Descriptor package, rings, worker FSM, top module, and testbench. Proves the core flow: descriptor in → worker FSM → completion out. The happy path only — no backpressure yet.
Doorbell-visible queue semantics
Add MMIO register block with proper address decode. Descriptors become visible to hardware only after doorbell write. Published tail tracking, status registers, and overflow saturating counters. Matches NVMe Submission Queue semantics.
Multi-worker dispatch
N worker FSMs, round-robin or credit-based dispatcher, completion merge arbiter, per-worker token counters. Proves the multi-trajectory case — M concurrent rollouts mapped to N hardware workers (M > N is valid, queue absorbs surplus).
Reward and done split
Separate REWARD_NEEDED and DONE completion paths. Reward-needed routing to a secondary queue. Completion backpressure stress tests. Runtime-configurable reward checkpoint interval via MMIO config register.
C + Verilator co-simulation
The C host submit API drives the RTL model and observes completions through the same hardware-style descriptor contract. The DPI bridge, the C struct layout, and the RTL packed struct are validated as identical. This is the hardware/software contract proof.
AXI4-Lite MMIO + MSI-X interrupt
Future. Replace the bare MMIO register block with a proper AXI4-Lite slave interface. Add an MSI-X interrupt generator so the host doesn't need to poll. This brings the control plane to feature parity with PCIe-attached devices and enables integration with standard FPGA IP.
The runtime becomes a hardware contract.
After building this RTL engine, the software fast path looks different. Every atomic store, every doorbell write, every completion poll now has a hardware analog. The mental model shifts from "ring buffer helper" to "DMA queue protocol."
Submit work
How do I submit inference work without CPU micromanagement per token? Answer: write a descriptor, release with ATOMIC_RELEASE, pulse the doorbell. Three lines. Done.
Keep progress resident
How do I keep rollout progression on the device without round-tripping through the host for every decode step? Answer: let the worker FSM tick autonomously. Host only sees completions.
Define protocol
What hardware queue protocol makes that execution model natural and correct? Answer: this. Descriptor ring + doorbell + worker FSM + completion ring + ready/valid backpressure.