A Kernel Architecture Note for Storage and RDMA Locality: blk-mq, io_uring, NVMe-oF, and Completion Provenance

Abstract. Linux storage and networking subsystems have become individually strong and collectively under-coordinated. In AI infrastructure, that lack of coordination now shows up as avoidable tail latency between device completion and application use. This note proposes a small cross-subsystem architecture based on completion provenance, home locality, and active latency windows. The idea is simple: if a request is likely to be consumed by a specific control thread, on a specific CPU or NUMA node, then SSD, NVMe-oF, blk-mq, io_uring, and RDMA completion handling should preserve that information instead of repeatedly discarding it.

1. The Problem

The media or transport is often finished before the application has meaningfully resumed.

Today, each layer optimizes something local. The NVMe device completes a command. The block layer records a completion. io_uring generates a CQE. The network stack services a completion queue. The scheduler wakes a task. The memory subsystem decides where the next pages or buffers live. None of these steps are unaware by accident. They are unaware by design. That design made sense when maximizing generic throughput was the dominant goal. It makes less sense when the same request is part of a tight control loop or GPU-feeding path.

current path:
  request issued
    -> device or fabric completes
    -> completion handled by local subsystem policy
    -> wakeup routed by scheduler policy
    -> consumer runs wherever it can
    -> memory ownership and registration handled separately

missing:
  request intent
  expected consumer locality
  repeated-flow identity
  latency budget

2. The Core Abstraction: Completion Provenance

I would add a lightweight structure that can survive from submission through completion:

struct completion_provenance {
    u32 flow_id;
    s16 home_cpu;
    s16 home_node;
    u16 class;      /* throughput, latency, step-critical, prefetch */
    u16 flags;      /* anti_migrate, bounded_batch, reuse_registration */
    u64 expires_ns; /* active latency window */
};

This is not meant to become a heavyweight policy object attached to every packet forever. It is meant to survive long enough to keep the important information alive: who is likely to consume this completion and what locality promises are worth preserving.

Design rule: if a request knows its consumer at submission time, the stack should not behave as if that information disappears at every layer boundary.

3. Why Existing Mechanisms Are Not Enough

Linux already has affinity knobs, queue mappings, NAPI locality, IRQ steering, and in some deployments carefully pinned pollers. Those mechanisms help, but they are still mostly orthogonal controls. They do not create a preserved identity for a flow across storage, transport, completion, and wakeup.

Affinity without provenance

You can pin queue pairs or interrupts, but the stack still lacks a durable explanation for why a given completion belongs near a given consumer.

Wakeup policy without flow continuity

The scheduler can prefer a previous CPU, but it does not inherently know that the previous CPU is also the right consumer locality for the finished storage or RDMA work.

4. Where It Hooks Into Linux

blk-mq

Carry provenance from request allocation through dispatch and completion. Completion CPU selection and deferred work behavior should consult locality hints before defaulting to generic completion policy.

io_uring

Expose latency-marked rings or per-request flags that bias CQE delivery, task_work timing, and userspace resumption toward the home CPU or home node.

NVMe-oF / RDMA

Preserve flow identity across network transport so completion handling on the target and initiator can align to the same consumer locality model.

Scheduler

Use short active windows to prefer wakeup locality and suppress low-value migrations right after relevant completions land.

5. blk-mq Should Gain a Locality Mode

blk-mq already scales well, but its abstractions are still biased toward scalable queueing rather than consumer-aware completion usefulness. A locality mode would add three things:

request-side provenance storage
completion steering rules that prefer the consumer locality island
bounded deferral so latency-marked completions do not disappear into generic worker contexts unless necessary

if (rq->prov.class == STEP_CRITICAL) {
    cpu = choose_cpu(rq->prov.home_cpu, current_completion_cpu, same_llc);
    if (within_window(rq->prov.expires_ns))
        avoid_generic_defer();
}

6. io_uring Needs a First-Class Latency Path

io_uring today is often used as a mechanism for efficient async I/O, but the completion semantics are still mostly general-purpose. For AI systems, I would want:

Need	Why it matters	Kernel-facing change
CQE locality	The consumer is often a stable control thread	Bias task_work or wakeup to home CPU
Latency mode	Completion batching can hurt step-critical paths	Per-ring or per-request bounded batching policy
Provenance carry-through	Submission knows intent earlier than completion does	Attach provenance to SQEs and propagate into CQE context
Observability	Tail latency is otherwise misattributed	Tracepoint joining request submit, device complete, CQE visible, wakeup, and on-CPU

7. RDMA Should Participate in the Same Policy Loop

RDMA is too important to be treated like an isolated transport box. For NVMe-oF, storage disaggregation, or storage-to-GPU paths, the transport should inherit and preserve the same provenance model instead of forcing everything above it to reconstruct locality again from scratch.

submission side:
  flow_id = hash(agent_or_gpu_pipeline, resource_id)
  home_cpu = current control thread cpu
  home_node = current memory home
  class = STEP_CRITICAL or PREFETCH

transport side:
  preserve flow_id and class into CQ handling
  prefer CQ polling / IRQ delivery near home locality
  reuse registration when flow identity is stable

8. Failure Modes This Would Prevent

Failure mode	What happens today	How provenance helps
Remote completion wakeup	Request completes on a CPU far from the consumer	Home CPU survives through completion steering and wakeup policy
Low-value batching	Step-critical completions get lumped into generic throughput behavior	Latency class allows bounded batching or bypass
Registration churn	Repeated RDMA or GPU-adjacent transfers keep paying setup cost	Stable flow identity makes reuse tractable
Observability blind spots	Tail latency gets blamed on the wrong subsystem	Flow identity ties submit, complete, wake, and run together

9. The Deterministic Embodiment

For the most latency-sensitive clusters, I would add a deterministic mode where one or more CPUs are designated storage-orchestration CPUs with reserved micro-windows for completion-heavy control paths.

Figure 1. The useful abstraction is not “faster I/O” alone. It is preserved locality across subsystem boundaries.

That mode would not be needed everywhere. But it gives the stack a way to bound jitter for the exact flows where generic fairness is the wrong objective.

10. Observability Must Join the Whole Path

Most teams can measure some part of this path and almost no one measures all of it together. I would add a tracing model like this:

key = {flow_id, submit_cpu, home_cpu, home_node}

tracepoints:
  block_rq_issue
  block_rq_complete
  io_uring submit / cqe visible
  nvme or nvme_tcp / rdma cq handling
  sched_wakeup
  sched_switch
  cpu_idle
  cpu_frequency

derived:
  device_done_to_cqe_visible
  cqe_visible_to_wakeup
  wakeup_to_oncpu
  remote_completion_ratio
  migrations_per_flow
  registration_reuse_hit_rate

11. A Minimal UAPI Sketch

The least disruptive user-facing entry point is probably a per-request or per-ring hint model rather than a brand-new scheduling class. Something as small as this is enough to make the architecture concrete:

struct io_locality_hint {
    __u32 flow_id;
    __s16 home_cpu;
    __s16 home_node;
    __u16 class;
    __u16 flags;
    __u32 budget_us;
};

The point is not that this exact structure is final. The point is that the kernel needs a way to carry locality intent across layers that currently only see generic work items.

12. Patch Queue Priorities

Phase	Subsystem	Deliverable
1	Tracing	Provenance-aware BPF tracing across block, io_uring, and scheduler
2	blk-mq	Request-side provenance storage and locality-aware completion steering
3	io_uring	Latency-marked rings or per-request bounded-batching flags
4	RDMA / NVMe-oF	Flow identity carry-through and locality-aware CQ handling
5	Scheduler	Short anti-migration and home-CPU windows for completion consumers
6	Memory path	Registration reuse and locality accounting for repeated flows

13. Final View

The missing layer in Linux storage and RDMA for AI is not another queue, another ring, or another transport feature by itself. It is the preservation of intent and locality across boundaries that currently throw that information away. Once you start viewing storage completions as orchestration events, not just successful I/O, the architecture changes become much clearer.

The short version: keep provenance alive, keep completions local, bound batching, and stop forcing every layer to rediscover consumer locality from scratch.