Abstract. Linux storage and networking subsystems have become individually strong and collectively under-coordinated. In AI infrastructure, that lack of coordination now shows up as avoidable tail latency between device completion and application use. This note proposes a small cross-subsystem architecture based on completion provenance, home locality, and active latency windows. The idea is simple: if a request is likely to be consumed by a specific control thread, on a specific CPU or NUMA node, then SSD, NVMe-oF, blk-mq, io_uring, and RDMA completion handling should preserve that information instead of repeatedly discarding it.
1. The Problem
The media or transport is often finished before the application has meaningfully resumed.
Today, each layer optimizes something local. The NVMe device completes a command. The block layer records a completion. io_uring generates a CQE. The network stack services a completion queue. The scheduler wakes a task. The memory subsystem decides where the next pages or buffers live. None of these steps are unaware by accident. They are unaware by design. That design made sense when maximizing generic throughput was the dominant goal. It makes less sense when the same request is part of a tight control loop or GPU-feeding path.
current path:
request issued
-> device or fabric completes
-> completion handled by local subsystem policy
-> wakeup routed by scheduler policy
-> consumer runs wherever it can
-> memory ownership and registration handled separately
missing:
request intent
expected consumer locality
repeated-flow identity
latency budget
2. The Core Abstraction: Completion Provenance
I would add a lightweight structure that can survive from submission through completion:
struct completion_provenance {
u32 flow_id;
s16 home_cpu;
s16 home_node;
u16 class; /* throughput, latency, step-critical, prefetch */
u16 flags; /* anti_migrate, bounded_batch, reuse_registration */
u64 expires_ns; /* active latency window */
};
This is not meant to become a heavyweight policy object attached to every packet forever. It is meant to survive long enough to keep the important information alive: who is likely to consume this completion and what locality promises are worth preserving.
Design rule: if a request knows its consumer at submission time, the stack should not behave as if that information disappears at every layer boundary.
3. Why Existing Mechanisms Are Not Enough
Linux already has affinity knobs, queue mappings, NAPI locality, IRQ steering, and in some deployments carefully pinned pollers. Those mechanisms help, but they are still mostly orthogonal controls. They do not create a preserved identity for a flow across storage, transport, completion, and wakeup.
Affinity without provenance
You can pin queue pairs or interrupts, but the stack still lacks a durable explanation for why a given completion belongs near a given consumer.
Wakeup policy without flow continuity
The scheduler can prefer a previous CPU, but it does not inherently know that the previous CPU is also the right consumer locality for the finished storage or RDMA work.
4. Where It Hooks Into Linux
blk-mq
Carry provenance from request allocation through dispatch and completion. Completion CPU selection and deferred work behavior should consult locality hints before defaulting to generic completion policy.
io_uring
Expose latency-marked rings or per-request flags that bias CQE delivery, task_work timing, and userspace resumption toward the home CPU or home node.
NVMe-oF / RDMA
Preserve flow identity across network transport so completion handling on the target and initiator can align to the same consumer locality model.
Scheduler
Use short active windows to prefer wakeup locality and suppress low-value migrations right after relevant completions land.
5. blk-mq Should Gain a Locality Mode
blk-mq already scales well, but its abstractions are still biased toward scalable queueing rather than consumer-aware completion usefulness. A locality mode would add three things:
- request-side provenance storage
- completion steering rules that prefer the consumer locality island
- bounded deferral so latency-marked completions do not disappear into generic worker contexts unless necessary
if (rq->prov.class == STEP_CRITICAL) {
cpu = choose_cpu(rq->prov.home_cpu, current_completion_cpu, same_llc);
if (within_window(rq->prov.expires_ns))
avoid_generic_defer();
}
6. io_uring Needs a First-Class Latency Path
io_uring today is often used as a mechanism for efficient async I/O, but the completion semantics are still mostly general-purpose. For AI systems, I would want:
| Need | Why it matters | Kernel-facing change |
|---|---|---|
| CQE locality | The consumer is often a stable control thread | Bias task_work or wakeup to home CPU |
| Latency mode | Completion batching can hurt step-critical paths | Per-ring or per-request bounded batching policy |
| Provenance carry-through | Submission knows intent earlier than completion does | Attach provenance to SQEs and propagate into CQE context |
| Observability | Tail latency is otherwise misattributed | Tracepoint joining request submit, device complete, CQE visible, wakeup, and on-CPU |
7. RDMA Should Participate in the Same Policy Loop
RDMA is too important to be treated like an isolated transport box. For NVMe-oF, storage disaggregation, or storage-to-GPU paths, the transport should inherit and preserve the same provenance model instead of forcing everything above it to reconstruct locality again from scratch.
submission side:
flow_id = hash(agent_or_gpu_pipeline, resource_id)
home_cpu = current control thread cpu
home_node = current memory home
class = STEP_CRITICAL or PREFETCH
transport side:
preserve flow_id and class into CQ handling
prefer CQ polling / IRQ delivery near home locality
reuse registration when flow identity is stable
8. Failure Modes This Would Prevent
| Failure mode | What happens today | How provenance helps |
|---|---|---|
| Remote completion wakeup | Request completes on a CPU far from the consumer | Home CPU survives through completion steering and wakeup policy |
| Low-value batching | Step-critical completions get lumped into generic throughput behavior | Latency class allows bounded batching or bypass |
| Registration churn | Repeated RDMA or GPU-adjacent transfers keep paying setup cost | Stable flow identity makes reuse tractable |
| Observability blind spots | Tail latency gets blamed on the wrong subsystem | Flow identity ties submit, complete, wake, and run together |
9. The Deterministic Embodiment
For the most latency-sensitive clusters, I would add a deterministic mode where one or more CPUs are designated storage-orchestration CPUs with reserved micro-windows for completion-heavy control paths.
That mode would not be needed everywhere. But it gives the stack a way to bound jitter for the exact flows where generic fairness is the wrong objective.
10. Observability Must Join the Whole Path
Most teams can measure some part of this path and almost no one measures all of it together. I would add a tracing model like this:
key = {flow_id, submit_cpu, home_cpu, home_node}
tracepoints:
block_rq_issue
block_rq_complete
io_uring submit / cqe visible
nvme or nvme_tcp / rdma cq handling
sched_wakeup
sched_switch
cpu_idle
cpu_frequency
derived:
device_done_to_cqe_visible
cqe_visible_to_wakeup
wakeup_to_oncpu
remote_completion_ratio
migrations_per_flow
registration_reuse_hit_rate
11. A Minimal UAPI Sketch
The least disruptive user-facing entry point is probably a per-request or per-ring hint model rather than a brand-new scheduling class. Something as small as this is enough to make the architecture concrete:
struct io_locality_hint {
__u32 flow_id;
__s16 home_cpu;
__s16 home_node;
__u16 class;
__u16 flags;
__u32 budget_us;
};
The point is not that this exact structure is final. The point is that the kernel needs a way to carry locality intent across layers that currently only see generic work items.
12. Patch Queue Priorities
| Phase | Subsystem | Deliverable |
|---|---|---|
| 1 | Tracing | Provenance-aware BPF tracing across block, io_uring, and scheduler |
| 2 | blk-mq | Request-side provenance storage and locality-aware completion steering |
| 3 | io_uring | Latency-marked rings or per-request bounded-batching flags |
| 4 | RDMA / NVMe-oF | Flow identity carry-through and locality-aware CQ handling |
| 5 | Scheduler | Short anti-migration and home-CPU windows for completion consumers |
| 6 | Memory path | Registration reuse and locality accounting for repeated flows |
13. Final View
The missing layer in Linux storage and RDMA for AI is not another queue, another ring, or another transport feature by itself. It is the preservation of intent and locality across boundaries that currently throw that information away. Once you start viewing storage completions as orchestration events, not just successful I/O, the architecture changes become much clearer.
The short version: keep provenance alive, keep completions local, bound batching, and stop forcing every layer to rediscover consumer locality from scratch.