MAN\SH AI / Writings

· RL Inference Systems · 18 min read

RTL / RL Inference
Hardware / Software Co-Design | Systems Engineering | ~18 min read

From Runtime Rings to RTL

Building a hardware queue engine for RL inference — descriptor rings, doorbells, rollout worker FSMs, completion queues, backpressure, and Verilator co-simulation. A systems-level deep dive into why your RL inference runtime is already shaped like hardware.

SystemVerilog Verilator MMIO Descriptors RL Inference FSM Design Ready / Valid Handshake C Co-Simulation PCIe-style Queue
Scope: This is not a GPU in RTL. It is a hardware model of the inference control plane — the queue protocol layer that a close-to-metal runtime wants to drive. No matrix math. No tensor cores. Pure control-plane contract.
PIPELINE SIGNAL TRACE — SINGLE ROLLOUT
CPU Write
Doorbell
Desc Ring
Worker FSM
Completion
Why RTL belongs here

RTL forces the runtime contract to become explicit.

Reinforcement learning inference has a different execution pattern than standard transformer serving. Rollouts are long, sequential, and reward-gated. The runtime cannot simply batch tokens once and return — it must manage per-rollout state across hundreds to thousands of decode steps, observe intermediate reward signals, and decide whether to continue, checkpoint, or terminate each trajectory.

The C/CUDA runtime already has opinions about how this should work: write a descriptor, publish a tail pointer, ring a doorbell, and avoid CPU micromanagement of every token step. But software conventions are fuzzy. The RTL model asks the harder question: what hardware queue protocol makes this execution model provably correct?

Ownership

Who owns the queue?

Head and tail ownership become hardware-visible state, not just informal conventions inside helper functions. The ring invariant is enforced by RTL, not by programmer discipline.

Ordering

When is work visible?

The doorbell is not magic. Descriptors must be fully written and released before the device observes the published tail. RTL makes the ordering contract concrete and testable.

Backpressure

What if completion is full?

RTL forces a real answer: stall the worker, route to overflow, or propagate ready/valid backpressure up the chain. Software can paper over this; hardware cannot.

The RTL block models the control-plane fabric around inference — not the matrix-math engine. Think DMA engine, not tensor core.

Why RL inference specifically?

Standard inference (one prompt → one response) is stateless from the queue's perspective. RL rollouts are fundamentally different:

  • Per-trajectory state — each rollout has a KV cache arena, a sequence length counter, and a reward model assignment that must persist across thousands of decode steps.
  • Reward checkpoints — the worker must emit intermediate REWARD_NEEDED completions at configurable intervals (e.g. every 32 tokens), not just a final DONE.
  • Heterogeneous work — a single descriptor ring may carry DECODE ops, REWARD scoring ops, and NOP barriers interleaved by the runtime scheduler.
  • Long-horizon backpressure — if the reward model lags, the rollout worker must stall cleanly without corrupting queue state or dropping completions.

These requirements map directly onto hardware queue semantics: typed descriptors, FSM-driven progress, completion rings with backpressure, and a shared descriptor contract between CPU and device.

The correct scope

A descriptor engine, not a GPU.

The goal is not to implement transformer attention, tensor cores, HBM scheduling, NVLink arbitration, or GB300 fabric internals. The goal is to model the hardware queue protocol that the software fast path wants to drive — and to prove it is correct by construction.

CONTROL-PLANE DATA PATH — END TO END
① CPU Runtime
Fills desc_t with rollout_id, kv_arena, seq_len, max_tokens, reward_model_id
② MMIO Doorbell
Atomic tail write + doorbell pulse makes new descriptors device-visible
③ Desc Ring
RTL FIFO enforces head/tail ownership, emits push_ready / pop_valid
④ Worker FSM
IDLE → DECODE → COMPLETE state machine; simulates rollout progress per descriptor
⑤ Completion Ring
Returns status, final_seq_len, reward_id to host; stalls worker if full
CONTROL-PLANE INVARIANT// The queue contract in one diagram:
descriptor_in
  → ownership_transfer (tail doorbell)
  → worker_state_machine (progress simulation)
  → completion_out (done | reward_needed)
  → backpressure_if_full (stall, never drop)

This is exactly the contract that PCIe NVMe drives, GPUs, and DMA engines expose to their host software stacks. Building it in RTL — even as a simulation model — means the software contract is no longer a comment in a header file. It is a synthesizable specification.

Software to hardware bridge

The C fast path is already RTL-shaped.

Look at the runtime submit path and you will see that every software primitive maps directly onto a hardware concept. This is not coincidence — it is the consequence of optimizing software until the only overhead left is the minimum required by the hardware protocol.

C FAST PATH — SUBMIT// Three lines. This is the entire "hot path".
q->entries[tail & RING_MASK] = desc;                   // write descriptor into ring slot
__atomic_store_n(&q->tail, tail + 1, __ATOMIC_RELEASE);// release ownership to device
mmio_write32(doorbell, tail + 1);                      // pulse the doorbell — work is visible

The atomic release store is the software equivalent of a write-barrier before the doorbell. Without it, the device may observe the doorbell increment before the descriptor bytes are coherent. In RTL, this ordering is enforced structurally: the doorbell_pulse signal is only asserted on the clock edge after the MMIO write, and the ring consumer only samples pop_valid after the pulse.

C Runtime RTL Module Semantic contract Why it matters
hw_desc_tdesc_pkg::desc_tFixed-width work orderOne cache-line → one ring slot. Atomic visibility.
hw_ring_push()desc_ringProducer/consumer ownershipFull/empty flags prevent overwrite and starvation.
mmio_write32(doorbell)doorbell_pulseDevice-visible notificationWork is not visible until tail is published.
infer_submit_decode()rollout_worker_fsmDescriptor-driven FSMDevice progresses without CPU involvement per token.
completion poll loopcompletion_ringHost observes device progressBackpressure propagates; host never misses a completion.
ATOMIC_RELEASEdoorbell clocked after writeMemory orderingDescriptor bytes coherent before device sees tail.
RTL module map

Small modules. Clear ownership. Composable interfaces.

Good RTL design favors narrow, well-defined modules with explicit port contracts over monolithic blocks. Each module below owns exactly one concept. Interfaces between modules are always ready/valid handshakes.

REPO LAYOUTrtl/
  desc_pkg.sv          // descriptor types, opcodes, completion types
  mmio_regs.sv         // MMIO write decoder, doorbell pulse generation
  desc_ring.sv         // parameterized FIFO/ring with push_ready/pop_valid
  completion_ring.sv   // completion FIFO; stalls worker if full
  rollout_worker_fsm.sv// IDLE→DECODE→COMPLETE FSM; emits completions
  rl_runtime_top.sv    // top-level module; wires all submodules together
  tb_rl_runtime_top.sv // SystemVerilog testbench
sim/
  infer_api.h          // C host interface matching RTL descriptor contract
  cosim_bridge.cpp     // Verilator DPI bridge
  run_sim.cpp          // C test driver
Package

desc_pkg.sv

Defines the shared vocabulary: desc_opcode_t enum, the desc_t packed struct (work order), and the completion_t packed struct (result). Every module imports this package. Changing a field here forces agreement everywhere — the compiler enforces the contract.

MMIO

mmio_regs.sv

Decodes MMIO write strobe + address. When the CPU writes the doorbell address, asserts doorbell_pulse for exactly one clock cycle and latches doorbell_value (the new published tail). All other addresses go to status/config registers.

Queue

desc_ring.sv

Parameterized power-of-two FIFO. Exports push_ready, push_valid, pop_valid, pop_ready with fully registered head/tail pointers. MSB-extended pointers give unambiguous full/empty detection even when indices alias.

Completion

completion_ring.sv

Mirror image of desc_ring, but in the reverse direction. The worker pushes completions; the host polls and pops. When full, push_ready goes low and the worker FSM stalls in S_COMPLETE. No silent drops, ever.

Worker

rollout_worker_fsm.sv

The heart of the control plane. Accepts one descriptor at a time, simulates token-by-token progress (one token per clock in RTL — mapped to real GPU latency in co-simulation), and emits DONE or REWARD_NEEDED completions.

Top

rl_runtime_top.sv

Structural module only — no logic, just port wiring. Connects MMIO doorbell to the descriptor ring push path, the ring pop path to the worker FSM, and the worker completion output to the completion ring. The architecture is visible at a glance.

Descriptor format

The descriptor is the hardware work order.

A descriptor is a fixed-width, packed struct that fits in one cache line. It is the contract between the CPU runtime and the device queue. Every field has a meaning; no field is optional; all must be present when the descriptor is pushed onto the ring.

SYSTEMVERILOG — desc_pkg.svpackage desc_pkg;

  // Opcodes identify the type of work in each descriptor
  typedef enum logic [7:0] {
    DESC_OP_NOP    = 8'd0,   // barrier / padding
    DESC_OP_DECODE = 8'd1,   // rollout decode step
    DESC_OP_REWARD = 8'd2,   // reward model evaluation
    DESC_OP_STOP   = 8'd255  // terminate rollout
  } desc_opcode_t;

  // Primary work descriptor — 192 bits (future: pad to 512b for cache-line alignment)
  typedef struct packed {
    logic [7:0]   opcode;         // operation type
    logic [7:0]   flags;          // stream bits, priority, stop-on-reward
    logic [15:0]  rollout_id;     // trajectory identifier (host-assigned)
    logic [15:0]  kv_arena_id;    // which KV cache slab owns this rollout
    logic [15:0]  prefix_id;      // shared prompt prefix (dedup index)
    logic [31:0]  kv_offset;      // byte offset into KV arena
    logic [31:0]  delta_offset;   // delta / new-token buffer offset
    logic [15:0]  seq_len;        // current sequence length at dispatch
    logic [15:0]  max_tokens;     // generation budget (EOS or budget exceeded)
    logic [15:0]  reward_model_id;// which reward head to evaluate against
    logic [15:0]  reserved;       // pad; future: checkpoint interval
  } desc_t;

  // Completion payload — what the device returns to the host
  typedef struct packed {
    logic [15:0]  rollout_id;     // echoed from descriptor
    logic [7:0]   status;         // 8'h01=done, 8'h02=reward_needed, 8'hFF=error
    logic [15:0]  final_seq_len;  // sequence length at completion
    logic [15:0]  reward_id;      // echoed reward_model_id
  } completion_t;

endpackage
Future direction: Expand desc_t to a true 512-bit (64-byte) descriptor so each ring slot occupies exactly one CPU cache line. This aligns RTL ring semantics with the C hardware descriptor, prevents false sharing, and makes the MMIO write path cache-friendly on the host side.
Field design

Why kv_arena_id + kv_offset?

RL inference may run hundreds of concurrent trajectories sharing a large KV cache pool. The arena ID selects a pre-allocated slab; the offset indexes within it. This two-level addressing avoids per-trajectory memory management in hardware and mirrors how CUDA stream memory arenas work.

Field design

Why reward_model_id in the descriptor?

Different rollout phases may use different reward heads (process reward vs. outcome reward vs. verifier). Encoding the model ID in the descriptor lets a future multi-worker dispatch engine route reward evaluation to the correct accelerator without re-reading host-side metadata.

Doorbell semantics

A doorbell turns a write into work visibility.

The doorbell is one of the most important concepts in hardware queue design. It is not "pressing a button." It is a precise protocol: the host writes descriptors, releases them with a memory barrier, then writes the doorbell register. Only after the doorbell write does the device consider any of the new descriptors visible.

In NVMe, this is the Submission Queue Tail Doorbell register at offset 0x1000 + 2*(2y)*DSTRD. In GPU command channels, it is the put pointer write. In this RTL model, it is a single MMIO write to address 0x10 that asserts doorbell_pulse for exactly one clock cycle.

SYSTEMVERILOG — mmio_regs.svmodule mmio_regs #(
  parameter int ADDR_W = 8
)(
  input  logic             clk,
  input  logic             rst_n,
  input  logic             wr_en,
  input  logic [ADDR_W-1:0] wr_addr,
  input  logic [31:0]       wr_data,
  output logic             doorbell_pulse,  // one cycle strobe
  output logic [31:0]       doorbell_value,  // new published tail
  output logic [31:0]       status_reg       // read-back for host polling
);
  localparam DOORBELL_ADDR = 8'h10;
  localparam STATUS_ADDR   = 8'h20;

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      doorbell_pulse <= 1'b0;
      doorbell_value <= 32'd0;
      status_reg     <= 32'd0;
    end else begin
      doorbell_pulse <= 1'b0;            // default: not asserted
      if (wr_en) begin
        case (wr_addr)
          DOORBELL_ADDR: begin
            doorbell_value <= wr_data;
            doorbell_pulse <= 1'b1;      // assert for exactly one cycle
          end
          STATUS_ADDR: status_reg <= wr_data;
        endcase
      end
    end
  end
endmodule
DOORBELL TIMING — 4-CLOCK SEQUENCE
T0 T1 T2 T3 wr_en wr_addr db_pulse db_value 0x10 tail+1 stable

The v0.2 RTL improvement makes this more faithful to real hardware: descriptors are written into a shared-memory region, and the ring consumer only advances past descriptors with index < doorbell_value. This matches the NVMe Submission Queue model exactly and prevents the device from consuming partially written descriptors.

Descriptor ring

Queue ownership becomes hardware state.

The classic power-of-two FIFO ring is one of the most important data structures in hardware design. Its invariants are simple, its implementation is small, and its correctness properties are provable by inspection.

RING INVARIANT — ALWAYS TRUE// Producer owns: [tail .. wrap) — slots it may write
// Consumer owns: [head .. tail) — slots it may read
//
// empty : head == tail
// full  : (tail[PTR_W-1:0] == head[PTR_W-1:0]) && (tail[PTR_W] != head[PTR_W])
// count : tail - head  (modulo 2^(PTR_W+1), but difference is always in [0, DEPTH])

producer writes tail   // never reads head to compute write address
consumer writes head   // never reads tail to compute read address
producer reads head    // to determine free slots before push
consumer reads tail    // to determine available work before pop
SYSTEMVERILOG — desc_ring.sv (core)module desc_ring import desc_pkg::*; #(
  parameter int DEPTH = 16,
  parameter int PTR_W = $clog2(DEPTH)
)(
  input  logic      clk, rst_n,
  input  logic      push_valid,
  input  desc_t    push_desc,
  output logic      push_ready,   // 0 = ring full, stall producer
  input  logic      pop_ready,
  output logic      pop_valid,    // 0 = ring empty, stall consumer
  output desc_t    pop_desc,
  output logic [PTR_W:0] fill_level // for status register / overflow counter
);
  desc_t              mem [DEPTH];
  logic [PTR_W:0]    head, tail;   // one extra bit for full/empty disambiguation

  assign push_ready  = !full;
  assign pop_valid   = !empty;
  assign pop_desc    = mem[head[PTR_W-1:0]];
  assign fill_level  = tail - head;

  wire empty = (head == tail);
  wire full  = ((tail[PTR_W-1:0] == head[PTR_W-1:0]) &&
                (tail[PTR_W]     != head[PTR_W]));

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      head <= '0; tail <= '0;
    end else begin
      if (push_valid && push_ready) begin   // handshake: both sides agree
        mem[tail[PTR_W-1:0]] <= push_desc;
        tail <= tail + 1'b1;
      end
      if (pop_valid && pop_ready) begin     // consumer signals readiness
        head <= head + 1'b1;
      end
    end
  end
endmodule
Ready/valid handshake: Both push_valid && push_ready must be true simultaneously for a transfer to occur. This is the AXI/TileLink handshake pattern. Neither side can proceed unilaterally — the protocol is immune to race conditions by construction.
Rollout worker FSM

The worker models rollout progression, not transformer math.

The worker FSM is the RTL model of what the GPU kernel does on the device side. In hardware simulation, it represents one decode step per clock. In Verilator co-simulation, it can be stretched to represent GPU latency by inserting wait states.

S_IDLE
Wait for pop_valid from descriptor ring. Assert pop_ready. Latch descriptor into cur register on acceptance.
S_DECODE
Increment token_count each clock. Check max_tokens and reward checkpoint interval. Emit completion when either threshold is reached.
S_COMPLETE
Assert comp_valid to completion ring. Stall here until comp_ready is asserted (backpressure). Return to S_IDLE on successful handshake.
SYSTEMVERILOG — rollout_worker_fsm.svtypedef enum logic [1:0] {
  S_IDLE     = 2'b00,
  S_DECODE   = 2'b01,
  S_COMPLETE = 2'b10
} fsm_state_t;

always_ff @(posedge clk or negedge rst_n) begin
  if (!rst_n) begin
    state <= S_IDLE; token_count <= '0;
  end else case (state)

    S_IDLE: begin
      if (pop_valid) begin          // new descriptor available
        cur         <= pop_desc;    // latch into working register
        token_count <= '0;
        if (pop_desc.opcode == DESC_OP_DECODE)
          state <= S_DECODE;
        else if (pop_desc.opcode == DESC_OP_STOP)
          state <= S_COMPLETE;      // emit DONE immediately
        // NOP: stay in IDLE, drain the slot
      end
    end

    S_DECODE: begin
      token_count <= token_count + 1'b1;

      if ((token_count + 1'b1) >= cur.max_tokens) begin
        // Budget exhausted — emit final DONE completion
        out_comp.rollout_id    <= cur.rollout_id;
        out_comp.status        <= 8'h01;           // DONE
        out_comp.final_seq_len <= token_count + 1'b1;
        out_comp.reward_id     <= cur.reward_model_id;
        state <= S_COMPLETE;

      end else if (((token_count + 1'b1) & REWARD_INTERVAL_MASK) == '0) begin
        // Reward checkpoint interval hit — emit REWARD_NEEDED
        out_comp.rollout_id    <= cur.rollout_id;
        out_comp.status        <= 8'h02;           // REWARD_NEEDED
        out_comp.final_seq_len <= token_count + 1'b1;
        out_comp.reward_id     <= cur.reward_model_id;
        state <= S_COMPLETE;
        // Worker will return to S_DECODE after host ACKs reward,
        // or to S_IDLE if rollout is terminated by scheduler.
      end
    end

    S_COMPLETE: begin
      if (comp_ready) begin         // completion ring accepted the result
        state <= S_IDLE;
      end
      // If comp_ready=0: completion ring full — stall here. No dropped results.
    end

  endcase
end

// Output drive: valid when in S_COMPLETE and completion not yet accepted
assign comp_valid  = (state == S_COMPLETE);
assign pop_ready   = (state == S_IDLE) && pop_valid;
assign comp_out    = out_comp;
Why S_COMPLETE stalls

Backpressure is not optional

If the host is slow to drain the completion ring and the worker simply overwrote completions or silently dropped them, the runtime would lose track of reward checkpoints. The worker stalls in S_COMPLETE until comp_ready is asserted — the completion ring's backpressure propagates cleanly to the worker.

Multi-worker extension

Round-robin dispatch (v0.3)

In v0.3, a dispatcher module will sit between the desc_ring and N worker FSMs. It pops a descriptor only when a worker is in S_IDLE. Completions from all workers merge into a single completion ring through an arbitration tree. Per-worker token counters become independent.

Completion path

Do not silently lose completions.

The completion ring is structurally identical to the descriptor ring but flows in the opposite direction. It is equally important — and equally easy to get wrong. A naive implementation might overwrite old completions when the host is slow. This RTL model refuses to do that.

Correctness

No silent overwrites

When push_ready is low, the worker stalls in S_COMPLETE. Completions are never overwritten. The host is guaranteed to see every result in the order they were generated.

Protocol

Ready / valid handshake

The worker drives comp_valid; the ring drives comp_ready. Transfer happens only when both are asserted. This is standard AXI4-Stream protocol — the same handshake used in every production DMA engine.

Backpressure

Propagate, don't absorb

A full completion ring stalls the worker. A stalled worker stops popping from the descriptor ring. The descriptor ring fills up. The MMIO push path sees push_ready=0 and the C runtime's doorbell write is not accepted. End-to-end backpressure, by construction.

STATUS REGISTER — HOST POLLING// Host can read these status registers at any time:
STATUS_REG[31:24] = worker_state;       // IDLE=0, DECODE=1, COMPLETE=2
STATUS_REG[23:16] = desc_ring_fill;     // how many descriptors queued
STATUS_REG[15:8]  = comp_ring_fill;     // how many completions waiting
STATUS_REG[7:0]   = overflow_count;     // saturating counter; 0=no drops
Interrupt vs. polling: This model uses polling for simplicity — the host loops on comp_ring_fill > 0. A production v0.6 extension would add an MSI-X interrupt generator: assert the interrupt line when comp_valid goes high and the host has enabled the interrupt enable bit in the status register. This eliminates busy-waiting and is how NVMe drives signal completion.
C + Verilator bridge

This is where it becomes a hardware/software co-design lab.

Verilator compiles the SystemVerilog RTL to C++. A thin DPI bridge exposes the RTL port map as a C struct. The C runtime calls infer_submit_decode() and observes completions through the same descriptor contract — but the "device" is now the RTL model, not a GPU.

This is not just a test harness. It is a proof that the software contract and the hardware contract are the same. Any divergence is a bug, caught at the boundary rather than in production silicon.

CO-SIM DATA PATHC infer_submit_decode()
  → Verilated rl_runtime_top (C++ class)
    → RTL: mmio_regs     (doorbell_pulse)
    → RTL: desc_ring     (push_valid / push_ready)
    → RTL: rollout_worker_fsm  (S_IDLE → S_DECODE → S_COMPLETE)
    → RTL: completion_ring (comp_valid / comp_ready)
  → C infer_poll_completion() (host polls pop_valid)
  → completion_t returned to caller
C HOST INTERFACE — infer_api.h// Mirrors the RTL descriptor contract exactly
typedef struct {
  uint8_t   opcode;
  uint8_t   flags;
  uint16_t  rollout_id;
  uint16_t  kv_arena_id;
  uint16_t  prefix_id;
  uint32_t  kv_offset;
  uint32_t  delta_offset;
  uint16_t  seq_len;
  uint16_t  max_tokens;
  uint16_t  reward_model_id;
  uint16_t  reserved;
} hw_desc_t;                              // must match desc_pkg::desc_t bit-for-bit

typedef struct {
  uint16_t  rollout_id;
  uint8_t   status;                           // 0x01=done, 0x02=reward_needed, 0xFF=err
  uint16_t  final_seq_len;
  uint16_t  reward_id;
} hw_completion_t;
C TEST DRIVER — run_sim.cpp (future v0.5)// Submit a decode descriptor
infer_submit_decode(&ctx,
                   /* rollout_id     */ rollout_id,
                   /* kv_arena_id   */ 0,
                   /* kv_offset     */ kv_offset,
                   /* delta_offset  */ delta_offset,
                   /* prefix_id     */ prefix_id,
                   /* seq_len       */ seq_len,
                   /* max_tokens    */ max_tokens);

// Poll for completion — tick the RTL model each iteration
hw_completion_t completion;
while (!infer_poll_completion(&ctx, &completion)) {
  verilator_tick(rtl_top);              // advance RTL model one clock
}

// Inspect result
if (completion.status == 0x02) {       // reward needed
  schedule_reward_eval(completion.rollout_id, completion.reward_id);
} else if (completion.status == 0x01) { // done
  finalize_trajectory(completion.rollout_id, completion.final_seq_len);
}
Why this matters: The co-simulation path forces the C runtime and RTL engine to share one descriptor contract. If a descriptor field is added, both the C struct and the SystemVerilog packed struct must be updated and the compiler will catch mismatches. If the RTL backpressures, the C side must handle it — there is no way to pretend it doesn't happen.
Roadmap

From RTL toy to co-simulated control-plane engine.

Each version proves one more contract. Taken together, v0.1–v0.5 deliver a complete hardware/software co-design lab for RL inference control planes. v0.6+ extends toward production AXI and interrupt semantics.

v0.1

Basic descriptor engine

Descriptor package, rings, worker FSM, top module, and testbench. Proves the core flow: descriptor in → worker FSM → completion out. The happy path only — no backpressure yet.

v0.2

Doorbell-visible queue semantics

Add MMIO register block with proper address decode. Descriptors become visible to hardware only after doorbell write. Published tail tracking, status registers, and overflow saturating counters. Matches NVMe Submission Queue semantics.

v0.3

Multi-worker dispatch

N worker FSMs, round-robin or credit-based dispatcher, completion merge arbiter, per-worker token counters. Proves the multi-trajectory case — M concurrent rollouts mapped to N hardware workers (M > N is valid, queue absorbs surplus).

v0.4

Reward and done split

Separate REWARD_NEEDED and DONE completion paths. Reward-needed routing to a secondary queue. Completion backpressure stress tests. Runtime-configurable reward checkpoint interval via MMIO config register.

v0.5

C + Verilator co-simulation

The C host submit API drives the RTL model and observes completions through the same hardware-style descriptor contract. The DPI bridge, the C struct layout, and the RTL packed struct are validated as identical. This is the hardware/software contract proof.

v0.6

AXI4-Lite MMIO + MSI-X interrupt

Future. Replace the bare MMIO register block with a proper AXI4-Lite slave interface. Add an MSI-X interrupt generator so the host doesn't need to poll. This brings the control plane to feature parity with PCIe-attached devices and enables integration with standard FPGA IP.

Final mental model

The runtime becomes a hardware contract.

After building this RTL engine, the software fast path looks different. Every atomic store, every doorbell write, every completion poll now has a hardware analog. The mental model shifts from "ring buffer helper" to "DMA queue protocol."

C runtime

Submit work

How do I submit inference work without CPU micromanagement per token? Answer: write a descriptor, release with ATOMIC_RELEASE, pulse the doorbell. Three lines. Done.

CUDA worker

Keep progress resident

How do I keep rollout progression on the device without round-tripping through the host for every decode step? Answer: let the worker FSM tick autonomously. Host only sees completions.

RTL model

Define protocol

What hardware queue protocol makes that execution model natural and correct? Answer: this. Descriptor ring + doorbell + worker FSM + completion ring + ready/valid backpressure.

This RTL engine models the hardware queue protocol that a close-to-metal RL inference runtime wants to talk to. Build it in RTL and the software contract stops being a convention. It becomes a specification.