The AI Storage Stack Is Still Wasting Latency
If a cluster can afford premium GPUs but still treats SSD controllers, blk-mq, io_uring, DMA, and RDMA as generic pipes, it is leaving real performance and cost efficiency on the table.
Abstract. Modern AI infrastructure spends enormous attention on accelerators and surprisingly little on the path that feeds them. That is a mistake. Inference and agentic systems increasingly depend on storage-backed model artifacts, prefetching, retrieval corpora, checkpoint movement, tool outputs, and disaggregated data paths whose performance is shaped by SSD controller policy, Linux block completions, DMA placement, userspace wakeups, and RDMA memory semantics. The problem is not that SSDs are too slow in aggregate. The problem is that the stack is still optimized around generic queue depth, fairness, and throughput instead of predictable completion locality and low tail latency. My view is that the biggest opportunity now is not a single faster drive. It is redesigning the stack so controller hints, queue affinity, blk-mq, io_uring, memory registration, and RDMA completion handling behave like one coordinated system.
1. The Wrong Storage Story for AI
People still describe AI storage as a bandwidth problem. That is only half the story.
Training and inference both need throughput, but many of the highest-value failures in real systems are not sustained-media limits. They are coordination failures. A read completes on the wrong CPU. A completion gets coalesced for throughput when the application needed it now. The page cache and buffer ownership drift across NUMA domains. The RDMA registration path burns cycles because the stack keeps acting as if each transfer is unrelated to the last. None of that shows up cleanly if you only plot GB/s.
real storage latency in AI =
media completion
+ controller scheduling
+ interrupt delivery
+ blk-mq completion path
+ io_uring or page-cache handoff
+ userspace wakeup
+ DMA / copy / registration side effects
+ possible RDMA or GPU transfer continuation
Once systems become orchestration-heavy or retrieval-heavy, small repeated coordination costs matter more than people expect. A GPU can sit idle because of a storage path that is technically healthy by classic metrics and still wrong for the workload.
Key point: the stack is too often tuned for average bandwidth while the workload depends on the latency shape of thousands of small control decisions.
2. The Stack Is Too Generic
The path from storage media to application state crosses too many layers that currently know too little about intent.
SSD controller
Understands flash timing, background work, queue scheduling, and internal DRAM, but usually sees host requests as generic reads and writes with weak semantic hints.
Linux block layer
Optimizes for queue parallelism and fairness, but often lacks a first-class notion of step-critical or latency-sensitive storage work.
io_uring / page cache
Reduces syscall overhead, but does not automatically solve completion locality, batching mistakes, or interaction with page ownership and wakeups.
RDMA and GPU path
Provides transport efficiency, but still leaves policy above transport mostly unowned: placement, admission, locality, registration reuse, and budget enforcement.
That fragmentation is why people often say “storage is fine” while the application still experiences stalls. The system is missing an end-to-end latency policy.
Another way to say it: the stack has plenty of mechanisms and too little intent. What is missing is not “more I/O.” It is shared agreement on which completions matter most, where they should land, and how much latency variance is acceptable.
3. The Hidden Tax in SSD and Block I/O
Linux block I/O on modern NVMe hardware is fast enough that the remaining taxes become architectural rather than obvious. They include completion CPU mismatch, excessive coalescing, generic worker fallback, memory locality drift, and poor attribution of delays after the media has already finished work.
| Layer | Classic optimization target | AI-era problem | Better target |
|---|---|---|---|
| SSD controller | Parallelism, wear, garbage collection | No strong host-visible hint for step-critical reads | Deadline classes and confidence-aware prefetch hints |
| NVMe queues | High queue depth | Queue depth can hide latency variance instead of fixing it | Mode switch between throughput and bounded-latency behavior |
| blk-mq | Scalable dispatch and completion | Completion locality not preserved end-to-end | Completion-to-consumer affinity as a first-class policy |
| io_uring | Low syscall overhead | CQE visibility and task_work timing still vary | Latency-mode CQE delivery and wakeup shaping |
| RDMA | Low-copy transport | Registration and placement policy still expensive | Long-lived memory intent and transport-aware control plane |
4. Completion Locality Matters More Than Another Queue
The storage stack still acts as if finishing the I/O is the main goal. For AI systems, finishing the I/O on the right CPU with the right memory ownership is usually more important than shaving a tiny amount off raw dispatch overhead.
That is why queue affinity is not a low-level detail anymore. It is product behavior.
5. io_uring Helped, But It Did Not Finish the Job
io_uring removed plenty of userspace/kernel friction, but many teams over-read what that means. It made async composition cheaper. It did not suddenly give the storage stack a policy brain.
- The CQE may still show up later than the application really wanted because of batching or task_work timing.
- The completion path may still land on a CPU that is wrong for the consumer.
- The memory touched next may still be remote, cold, or trapped behind a registration/copy decision that lives outside the ring.
In AI systems, io_uring should be thought of as a mechanism that needs a stronger latency mode above it, not as the final answer.
6. RDMA Is Powerful and Still Incomplete
RDMA solved a transport problem brilliantly. It did not solve the system problem around transport. That missing layer is becoming more obvious now that AI clusters care about memory placement, storage-backed prefetch, GPU-feeding latency, and multi-tenant cost control.
What RDMA already does well
Low-copy movement, queue pairs, completion queues, transport offload, and high-throughput low-CPU data transfer across fabrics.
What it still does poorly as a system
Intent expression, locality policy, long-lived registration reuse, storage alignment, and application-visible latency contracts.
The missing abstraction is not another verb. It is a control plane above the verbs that can say:
this transfer is:
step-critical
likely to be consumed by CPU X / GPU Y
expected to recur
worth pre-registration
allowed to bypass generic batching
subject to a latency budget rather than a throughput-only budget
7. The Measurement Model Is Wrong Too Often
One reason this area stays under-optimized is that the wrong counters still dominate reviews. Teams look at SSD bandwidth, average latency, queue depth, or network throughput and conclude the stack is healthy. For AI systems, I would care more about these:
| Metric | Why it matters | Likely root cause when bad |
|---|---|---|
| device_done_to_userspace_ready | Captures whether completion was actually useful when it arrived | IRQ mismatch, blk-mq deferral, task_work lag, wakeup noise |
| completion_cpu != consumer_cpu ratio | Shows whether locality is being preserved or thrown away | Queue affinity mistakes, generic interrupt spray, weak steering |
| registration_reuse hit rate | Makes RDMA and GPU-adjacent memory churn visible | Short-lived mappings, transport policy ignorance, poor buffer lifecycle |
| storage_to_gpu continuation gap | Measures how long it takes for bytes to become accelerator-usable | Bounce buffers, copy paths, IOMMU overhead, wakeup mismatch |
tracepoints I would join:
block_rq_issue
block_rq_complete
nvme completion hooks
io_uring CQE visibility
sched_wakeup
sched_switch
dma_fence or GPU handoff markers when available
RDMA CQ handling
derived signals:
useful_completion_ns
locality_miss_rate
deferred_completion_rate
registration_churn_per_flow
8. The Storage-to-GPU Path Is Still Too Wasteful
Even in systems that advertise direct paths, the reality is often messier. There are still hidden bounce buffers, registration churn, IOMMU overhead, pinned-memory lifecycle costs, userspace wakeup mismatch, and queueing decisions that were made for old throughput assumptions instead of modern accelerator economics.
The expensive mistake: buying faster accelerators while leaving the storage-to-GPU feed path governed by generic server defaults.
9. What I Would Change First
| Priority | Area | Change | Why it matters |
|---|---|---|---|
| 1 | Completion locality | Align SSD queue affinity, blk-mq completion CPU, and userspace consumer CPU | Reduces the repeated boundary tax after the media is already done |
| 2 | Latency mode | Add bounded-latency operating modes across controller, block layer, and io_uring | Lets the stack stop pretending every workload wants maximum queue depth |
| 3 | Hinting | Expose step-critical, hot metadata, and prefetch-confidence hints to the storage stack | Controllers and the host need better semantics than generic reads |
| 4 | Registration reuse | Treat repeated RDMA and GPU-adjacent memory regions as first-class, long-lived objects | Registration churn is wasted CPU time in stable pipelines |
| 5 | Observability | Trace from NVMe completion through wakeup and onward to GPU-feed continuation | You cannot fix what the stack refuses to attribute |
10. A Better Way to Think About the Stack
The storage path should not be decomposed into isolated teams each maximizing its own local metric. It should be treated as a coordinated latency surface with shared ownership.
desired policy loop:
application declares intent
-> controller and host queues choose latency mode
-> completion steered to consumer locality
-> io_uring or cache hands off without generic detours
-> memory registration / DMA reuse honored
-> possible RDMA or GPU continuation inherits same policy
-> instrumentation records end-to-end usefulness
That is how you stop the system from wasting tens or hundreds of microseconds at every boundary and then hiding behind good aggregate bandwidth numbers.
11. Why This Matters More for Agents and Retrieval
Agentic systems, retrieval-heavy systems, and disaggregated inference systems all amplify storage path flaws because they chain many dependent operations together. They are not just “reading data.” They are reading data that gates the next decision. That means the kernel and controller path becomes part of the reasoning loop.
12. Final View
The next storage win for AI is probably not a new marketing number on the SSD box. It is a control-plane upgrade across SSD controller policy, Linux completion locality, io_uring latency behavior, memory registration, and RDMA semantics. The hardware is already fast enough that the coordination mistakes now dominate more than people want to admit.
The short version: the AI storage stack is still too generic. The systems that win will be the ones that treat storage completions as latency-sensitive orchestration events, not just successful I/O.