MANISH AI
Kernel Note / Storage / RDMA / Locality

A Kernel Architecture Note for Storage and RDMA Locality

A proposal for treating blk-mq, io_uring, NVMe-oF, and RDMA completions as one locality-sensitive control path rather than four independent subsystems with accidental interactions.

DATE 2026-05-07
READ 20 MIN
FOCUS BLK-MQ / IO_URING / NVME-OF / RDMA
THESIS COMPLETION PROVENANCE

Abstract. Linux storage and networking subsystems have become individually strong and collectively under-coordinated. In AI infrastructure, that lack of coordination now shows up as avoidable tail latency between device completion and application use. This note proposes a small cross-subsystem architecture based on completion provenance, home locality, and active latency windows. The idea is simple: if a request is likely to be consumed by a specific control thread, on a specific CPU or NUMA node, then SSD, NVMe-oF, blk-mq, io_uring, and RDMA completion handling should preserve that information instead of repeatedly discarding it.

1. The Problem

The media or transport is often finished before the application has meaningfully resumed.

Today, each layer optimizes something local. The NVMe device completes a command. The block layer records a completion. io_uring generates a CQE. The network stack services a completion queue. The scheduler wakes a task. The memory subsystem decides where the next pages or buffers live. None of these steps are unaware by accident. They are unaware by design. That design made sense when maximizing generic throughput was the dominant goal. It makes less sense when the same request is part of a tight control loop or GPU-feeding path.

current path:
  request issued
    -> device or fabric completes
    -> completion handled by local subsystem policy
    -> wakeup routed by scheduler policy
    -> consumer runs wherever it can
    -> memory ownership and registration handled separately

missing:
  request intent
  expected consumer locality
  repeated-flow identity
  latency budget

2. The Core Abstraction: Completion Provenance

I would add a lightweight structure that can survive from submission through completion:

struct completion_provenance {
    u32 flow_id;
    s16 home_cpu;
    s16 home_node;
    u16 class;      /* throughput, latency, step-critical, prefetch */
    u16 flags;      /* anti_migrate, bounded_batch, reuse_registration */
    u64 expires_ns; /* active latency window */
};

This is not meant to become a heavyweight policy object attached to every packet forever. It is meant to survive long enough to keep the important information alive: who is likely to consume this completion and what locality promises are worth preserving.

Design rule: if a request knows its consumer at submission time, the stack should not behave as if that information disappears at every layer boundary.

3. Why Existing Mechanisms Are Not Enough

Linux already has affinity knobs, queue mappings, NAPI locality, IRQ steering, and in some deployments carefully pinned pollers. Those mechanisms help, but they are still mostly orthogonal controls. They do not create a preserved identity for a flow across storage, transport, completion, and wakeup.

Affinity without provenance

You can pin queue pairs or interrupts, but the stack still lacks a durable explanation for why a given completion belongs near a given consumer.

Wakeup policy without flow continuity

The scheduler can prefer a previous CPU, but it does not inherently know that the previous CPU is also the right consumer locality for the finished storage or RDMA work.

4. Where It Hooks Into Linux

blk-mq

Carry provenance from request allocation through dispatch and completion. Completion CPU selection and deferred work behavior should consult locality hints before defaulting to generic completion policy.

io_uring

Expose latency-marked rings or per-request flags that bias CQE delivery, task_work timing, and userspace resumption toward the home CPU or home node.

NVMe-oF / RDMA

Preserve flow identity across network transport so completion handling on the target and initiator can align to the same consumer locality model.

Scheduler

Use short active windows to prefer wakeup locality and suppress low-value migrations right after relevant completions land.

5. blk-mq Should Gain a Locality Mode

blk-mq already scales well, but its abstractions are still biased toward scalable queueing rather than consumer-aware completion usefulness. A locality mode would add three things:

if (rq->prov.class == STEP_CRITICAL) {
    cpu = choose_cpu(rq->prov.home_cpu, current_completion_cpu, same_llc);
    if (within_window(rq->prov.expires_ns))
        avoid_generic_defer();
}

6. io_uring Needs a First-Class Latency Path

io_uring today is often used as a mechanism for efficient async I/O, but the completion semantics are still mostly general-purpose. For AI systems, I would want:

NeedWhy it mattersKernel-facing change
CQE localityThe consumer is often a stable control threadBias task_work or wakeup to home CPU
Latency modeCompletion batching can hurt step-critical pathsPer-ring or per-request bounded batching policy
Provenance carry-throughSubmission knows intent earlier than completion doesAttach provenance to SQEs and propagate into CQE context
ObservabilityTail latency is otherwise misattributedTracepoint joining request submit, device complete, CQE visible, wakeup, and on-CPU

7. RDMA Should Participate in the Same Policy Loop

RDMA is too important to be treated like an isolated transport box. For NVMe-oF, storage disaggregation, or storage-to-GPU paths, the transport should inherit and preserve the same provenance model instead of forcing everything above it to reconstruct locality again from scratch.

submission side:
  flow_id = hash(agent_or_gpu_pipeline, resource_id)
  home_cpu = current control thread cpu
  home_node = current memory home
  class = STEP_CRITICAL or PREFETCH

transport side:
  preserve flow_id and class into CQ handling
  prefer CQ polling / IRQ delivery near home locality
  reuse registration when flow identity is stable

8. Failure Modes This Would Prevent

Failure modeWhat happens todayHow provenance helps
Remote completion wakeupRequest completes on a CPU far from the consumerHome CPU survives through completion steering and wakeup policy
Low-value batchingStep-critical completions get lumped into generic throughput behaviorLatency class allows bounded batching or bypass
Registration churnRepeated RDMA or GPU-adjacent transfers keep paying setup costStable flow identity makes reuse tractable
Observability blind spotsTail latency gets blamed on the wrong subsystemFlow identity ties submit, complete, wake, and run together

9. The Deterministic Embodiment

For the most latency-sensitive clusters, I would add a deterministic mode where one or more CPUs are designated storage-orchestration CPUs with reserved micro-windows for completion-heavy control paths.

Storage / RDMA Locality Control PlaneCompletion provenance is preserved instead of repeatedly thrown away.Provenance Controllerflow_id, home_cpu, home_nodeclass, latency window, reuse flagsUserspace submitterknows intent earliestblk-mqdispatch + completelocality modeio_uringCQE deliverybounded batchingRDMA / NVMe-oFtransport continuityregistration reuseScheduler + Memoryhome CPU, anti-migrate, local ownershipactive latency window
Figure 1. The useful abstraction is not “faster I/O” alone. It is preserved locality across subsystem boundaries.

That mode would not be needed everywhere. But it gives the stack a way to bound jitter for the exact flows where generic fairness is the wrong objective.

10. Observability Must Join the Whole Path

Most teams can measure some part of this path and almost no one measures all of it together. I would add a tracing model like this:

key = {flow_id, submit_cpu, home_cpu, home_node}

tracepoints:
  block_rq_issue
  block_rq_complete
  io_uring submit / cqe visible
  nvme or nvme_tcp / rdma cq handling
  sched_wakeup
  sched_switch
  cpu_idle
  cpu_frequency

derived:
  device_done_to_cqe_visible
  cqe_visible_to_wakeup
  wakeup_to_oncpu
  remote_completion_ratio
  migrations_per_flow
  registration_reuse_hit_rate

11. A Minimal UAPI Sketch

The least disruptive user-facing entry point is probably a per-request or per-ring hint model rather than a brand-new scheduling class. Something as small as this is enough to make the architecture concrete:

struct io_locality_hint {
    __u32 flow_id;
    __s16 home_cpu;
    __s16 home_node;
    __u16 class;
    __u16 flags;
    __u32 budget_us;
};

The point is not that this exact structure is final. The point is that the kernel needs a way to carry locality intent across layers that currently only see generic work items.

12. Patch Queue Priorities

PhaseSubsystemDeliverable
1TracingProvenance-aware BPF tracing across block, io_uring, and scheduler
2blk-mqRequest-side provenance storage and locality-aware completion steering
3io_uringLatency-marked rings or per-request bounded-batching flags
4RDMA / NVMe-oFFlow identity carry-through and locality-aware CQ handling
5SchedulerShort anti-migration and home-CPU windows for completion consumers
6Memory pathRegistration reuse and locality accounting for repeated flows

13. Final View

The missing layer in Linux storage and RDMA for AI is not another queue, another ring, or another transport feature by itself. It is the preservation of intent and locality across boundaries that currently throw that information away. Once you start viewing storage completions as orchestration events, not just successful I/O, the architecture changes become much clearer.

The short version: keep provenance alive, keep completions local, bound batching, and stop forcing every layer to rediscover consumer locality from scratch.