Kernel and Driver Samples for AI Storage and RDMA Locality

Abstract. It is easy to say that Linux should preserve completion provenance across SSD, block, io_uring, and RDMA boundaries. The harder question is where the bits would live, how they would get attached to requests, which call sites would consult them, and how teams would measure whether the design actually reduced useless latency. This post is a code-heavy companion to the previous storage/RDMA essays. The samples here are not upstream-ready patches. They are design sketches meant to make the architecture concrete: request-side metadata in blk-mq, bounded completion deferral, io_uring hint propagation, RDMA flow identity reuse, and BPF tracing that measures completion usefulness instead of only completion existence.

1. What the Samples Need to Prove

A good sample is not a toy. It should answer where the state lives, when it is set, who consumes it, and what metric gets better if the mechanism works.

State

Where does locality intent live after userspace submits the work? If that answer is “somewhere vague,” the design is not real yet.

Timing

When do we decide between throughput mode and latency mode? Doing it after completion is too late for many of the wins.

Reuse

How does a repeated flow avoid paying registration, CPU discovery, and queue selection costs every time?

Measurement

Which tracepoints expose the delta? If the only chart is GB/s, the sample is missing the point.

My bias: I would rather see a slightly ugly but concrete per-request struct and a measurable tracepoint story than another abstract paragraph about “policy.”

2. Request-Side Provenance in blk-mq

The first problem is that the Linux storage path does not preserve enough submission-side context. So the most direct sample is to hang a lightweight provenance record off the request.

enum io_latency_class {
    IO_CLASS_THROUGHPUT = 0,
    IO_CLASS_LATENCY    = 1,
    IO_CLASS_STEP_CRIT  = 2,
    IO_CLASS_PREFETCH   = 3,
};

struct io_completion_prov {
    u32 flow_id;
    s16 home_cpu;
    s16 home_node;
    u16 latency_class;
    u16 flags;
    u32 budget_us;
    u64 expires_ns;
};

#define IO_PROV_F_ANTI_MIGRATE   BIT(0)
#define IO_PROV_F_BOUNDED_BATCH  BIT(1)
#define IO_PROV_F_REUSE_REG      BIT(2)

struct request {
    ...
    struct io_completion_prov *prov;
    ...
};

I would not default-allocate this for every request in the first prototype. A simpler design is to allocate or embed it only for requests marked from a latency-aware submission path. That keeps the sample honest about overhead.

2.1 Submission-Time Attachment

static inline void blk_rq_attach_prov(struct request *rq,
                                      const struct io_completion_prov *src)
{
    if (!src)
        return;

    rq->prov = kmem_cache_alloc(prov_cachep, GFP_ATOMIC);
    if (!rq->prov)
        return;

    *rq->prov = *src;
}

The key point here is not the allocator choice. It is that flow_id, home_cpu, home_node, and budget_us stop being “userspace knowledge only.”

3. Completion Steering in the Block Layer

The next sample is where the earlier argument becomes real: completion CPU selection should care about consumer locality, not just generic dispatch mechanics.

static int choose_completion_cpu(struct request *rq, int complete_cpu)
{
    const struct io_completion_prov *prov = READ_ONCE(rq->prov);

    if (!prov)
        return complete_cpu;

    if (prov->latency_class < IO_CLASS_LATENCY)
        return complete_cpu;

    if (cpu_online(prov->home_cpu) &&
        cpus_share_cache(prov->home_cpu, complete_cpu))
        return prov->home_cpu;

    if (cpu_online(prov->home_cpu))
        return prov->home_cpu;

    return complete_cpu;
}

This is intentionally simple. A real implementation would use LLC topology, queue ownership, and perhaps current per-CPU load. But the design objective is visible: do not throw away knowledge of who is likely to consume the completion.

3.1 Bounded Deferral Instead of Unlimited “Helpful” Deferral

static bool should_bypass_generic_defer(const struct request *rq, u64 now_ns)
{
    const struct io_completion_prov *prov = READ_ONCE(rq->prov);

    if (!prov)
        return false;

    if (!(prov->flags & IO_PROV_F_BOUNDED_BATCH))
        return false;

    return now_ns <= prov->expires_ns;
}

This matters because many of the worst AI storage path failures are not media failures. They are “the kernel helpfully delayed a useful completion because the generic throughput heuristics said it was efficient.”

Design principle: if a flow carries a near-term latency budget, generic batching should need to justify itself rather than happening automatically.

4. NVMe Driver Sketch: Queue and IRQ Affinity

The controller layer is where locality can get lost very early. If the queue mapping and MSI-X affinity spray completions arbitrarily, the higher layers are already working uphill.

static void nvme_map_latency_flow(struct nvme_queue *nvmeq,
                                  const struct io_completion_prov *prov)
{
    if (!prov)
        return;

    if (prov->latency_class < IO_CLASS_LATENCY)
        return;

    if (cpu_online(prov->home_cpu))
        irq_set_affinity_hint(nvmeq->irq, cpumask_of(prov->home_cpu));
}

I would not call irq_set_affinity_hint() per request in a production path. That would be too expensive and too unstable. But as a sample, it communicates the correct control surface: latency-marked queue flows should influence how the queue and its completion vectors are placed.

4.1 Completion Path Hook

static void nvme_complete_rq(struct request *rq)
{
    int cpu = choose_completion_cpu(rq, raw_smp_processor_id());

    if (cpu != raw_smp_processor_id())
        smp_call_function_single_async(cpu, &rq->csd);
    else
        blk_mq_complete_request_local(rq);
}

This exact sketch is not proposing that every completion become an IPI. The point is to show the decision boundary explicitly: there is a branch where locality-aware steering can happen. If the kernel never admits that branch exists, the architecture remains generic by construction.

5. io_uring UAPI and Propagation Sample

The userspace-to-kernel handoff needs a compact way to express flow identity and latency mode. A hint structure is enough to make the design concrete.

struct io_uring_locality_hint {
    __u32 flow_id;
    __s16 home_cpu;
    __s16 home_node;
    __u16 latency_class;
    __u16 flags;
    __u32 budget_us;
};

Then the submission path can copy that into the request-side provenance record.

static int io_req_set_locality(struct io_kiocb *req,
                               const struct io_uring_locality_hint *hint)
{
    struct io_completion_prov prov = {
        .flow_id       = hint->flow_id,
        .home_cpu      = hint->home_cpu,
        .home_node     = hint->home_node,
        .latency_class = hint->latency_class,
        .flags         = hint->flags,
        .budget_us     = hint->budget_us,
        .expires_ns    = ktime_get_ns() + ((u64)hint->budget_us * 1000ULL),
    };

    req->prov = prov;
    return 0;
}

5.1 CQE Visibility Matters Too

static inline bool io_cqe_should_flush_now(struct io_ring_ctx *ctx,
                                           const struct io_completion_prov *prov)
{
    if (!prov)
        return false;

    if (prov->latency_class == IO_CLASS_STEP_CRIT)
        return true;

    return false;
}

This is the missing bridge between “the device is done” and “the application can act now.” A lot of systems stop mentally at device completion. io_uring forces you to be precise about the second half of the path too.

6. RDMA Flow Identity and Registration Reuse

RDMA is where systems keep pretending every transfer is a fresh event. For stable AI pipelines, that is often false. The same tensor region, staging buffer, or tool-output buffer gets hit repeatedly.

struct rdma_flow_state {
    u32 flow_id;
    s16 home_cpu;
    s16 home_node;
    u32 reg_cookie;
    u64 last_use_ns;
    struct ib_mr *mr;
};

static struct ib_mr *rdma_lookup_or_reg_mr(struct rdma_flow_state *flow,
                                           struct ib_pd *pd,
                                           void *addr, size_t len)
{
    if (flow->mr && region_still_valid(flow, addr, len)) {
        flow->last_use_ns = ktime_get_ns();
        return flow->mr;
    }

    flow->mr = ib_reg_mr(pd, addr, len, IB_ACCESS_LOCAL_WRITE |
                                        IB_ACCESS_REMOTE_READ |
                                        IB_ACCESS_REMOTE_WRITE);
    flow->last_use_ns = ktime_get_ns();
    return flow->mr;
}

This sample is not radical. It is just honest about the fact that reuse requires a persistent identity. If the stack never carries flow_id, it cannot build sane reuse policy above the verbs.

6.1 Completion Queue Locality

static int rdma_choose_poll_cpu(const struct rdma_flow_state *flow, int cq_cpu)
{
    if (!flow)
        return cq_cpu;

    if (cpu_online(flow->home_cpu))
        return flow->home_cpu;

    return cq_cpu;
}

Again, the point is not that this tiny helper is enough. The point is that transport code should explicitly admit the question “who should consume this completion?” instead of delegating the answer to accident.

7. Storage-to-GPU Bounce-Path Guard Sample

Even before true storage-to-GPU continuation is perfect, the kernel can at least make bounce costs visible and avoidable in the common case.

struct gpu_feed_desc {
    u32 flow_id;
    u32 gpu_id;
    dma_addr_t dma_addr;
    u64 bytes;
    bool used_bounce;
};

static void trace_gpu_feed_handoff(const struct gpu_feed_desc *d)
{
    trace_printk("flow=%u gpu=%u bytes=%llu bounce=%d\n",
                 d->flow_id, d->gpu_id,
                 (unsigned long long)d->bytes, d->used_bounce);
}

A surprising amount of infrastructure still has no crisp counter for “how often did this supposedly direct path bounce through host-owned memory?” That is exactly the sort of missing instrumentation that keeps teams overconfident.

8. BPF Sample: Measure Completion Usefulness

If I only got to ship one part of this design quickly, it might be the tracing. Better policy gets easier once the measurements stop lying.

struct flow_key {
    u32 flow_id;
    u32 pid;
};

struct flow_state {
    u64 issue_ns;
    u64 complete_ns;
    u64 wake_ns;
    u32 complete_cpu;
    u32 wake_cpu;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, struct flow_key);
    __type(value, struct flow_state);
    __uint(max_entries, 16384);
} flows SEC(".maps");

SEC("tracepoint/block/block_rq_complete")
int tp_block_complete(struct trace_event_raw_block_rq_complete *ctx)
{
    struct flow_key key = {};
    struct flow_state *st;

    key.flow_id = extract_flow_id(ctx);
    key.pid = bpf_get_current_pid_tgid() >> 32;

    st = bpf_map_lookup_elem(&flows, &key);
    if (!st)
        return 0;

    st->complete_ns = bpf_ktime_get_ns();
    st->complete_cpu = bpf_get_smp_processor_id();
    return 0;
}

SEC("tracepoint/sched/sched_wakeup")
int tp_sched_wakeup(struct trace_event_raw_sched_wakeup *ctx)
{
    struct flow_key key = {};
    struct flow_state *st;

    key.flow_id = lookup_flow_id_for_pid(ctx->pid);
    key.pid = ctx->pid;

    st = bpf_map_lookup_elem(&flows, &key);
    if (!st)
        return 0;

    st->wake_ns = bpf_ktime_get_ns();
    st->wake_cpu = bpf_get_smp_processor_id();
    return 0;
}

derived:
  useful_completion_ns = wake_ns - complete_ns
  remote_completion    = (complete_cpu != wake_cpu)
  locality_debt        = oncpu_ns - wake_ns

That is the metric story the architecture needs. Not “average disk latency.” Not “CQ depth.” Measure how long it took for a completion to become actionable by the next consumer.

Warning: if your tracing can only tell you the media was done, it is still blind to a large fraction of the problem.

9. How I Would Stage a Real Patch Queue

Phase	Change	Why it should land first
1	BPF tracing keyed by flow identity and consumer PID	Proves whether locality and bounded batching are actually a problem on the target fleet
2	io_uring hint propagation into block request metadata	Minimal path to attach submission-side intent to real I/O
3	blk-mq completion steering with bounded deferral bypass	Moves the critical-path policy decision into the place where it can matter
4	NVMe queue locality mode and optional IRQ affinity shaping	Prevents lower layers from destroying locality before the block layer can recover it
5	RDMA flow reuse state and CQ locality hints	Turns transport from a stateless mechanism into part of the same policy loop

10. What I Deliberately Left Out

I left out a lot of complications on purpose: full locking rules, precise queue lifetime semantics, backpressure interaction, multipath, device reset behavior, IOMMU invalidation details, and NUMA page-cache steering. Those are all real. But the point of a sample essay is to pin down the skeleton before drowning in edge cases.

I did not attempt an upstream-quality API.
I did not assume one universal queue-selection rule.
I did not hide the fact that some policies belong in userspace and some in the kernel.
I did not claim these hooks solve storage-to-GPU feeding by themselves.

What the samples do show is that the previous two essays can be translated into real touch points in Linux, not just architecture rhetoric.

11. Final View

The biggest gap in AI storage and RDMA today is not the absence of mechanisms. It is the absence of a shared identity that can survive from submission through completion, wakeup, and reuse. Once that identity exists, the code paths that need to consult it become surprisingly obvious: block completion, queue affinity, CQE visibility, registration caches, and tracepoints.

The short version: if you cannot point to the struct, the call site, and the tracepoint, then the policy probably is not real yet.