MANISH AI
Storage Systems Essay / Linux / RDMA

The AI Storage Stack Is Still Wasting Latency

If a cluster can afford premium GPUs but still treats SSD controllers, blk-mq, io_uring, DMA, and RDMA as generic pipes, it is leaving real performance and cost efficiency on the table.

23 min readFocus: SSD / block / io_uring / RDMAThesis: latency mode beats generic throughput mode

Abstract. Modern AI infrastructure spends enormous attention on accelerators and surprisingly little on the path that feeds them. That is a mistake. Inference and agentic systems increasingly depend on storage-backed model artifacts, prefetching, retrieval corpora, checkpoint movement, tool outputs, and disaggregated data paths whose performance is shaped by SSD controller policy, Linux block completions, DMA placement, userspace wakeups, and RDMA memory semantics. The problem is not that SSDs are too slow in aggregate. The problem is that the stack is still optimized around generic queue depth, fairness, and throughput instead of predictable completion locality and low tail latency. My view is that the biggest opportunity now is not a single faster drive. It is redesigning the stack so controller hints, queue affinity, blk-mq, io_uring, memory registration, and RDMA completion handling behave like one coordinated system.

1. The Wrong Storage Story for AI

People still describe AI storage as a bandwidth problem. That is only half the story.

Training and inference both need throughput, but many of the highest-value failures in real systems are not sustained-media limits. They are coordination failures. A read completes on the wrong CPU. A completion gets coalesced for throughput when the application needed it now. The page cache and buffer ownership drift across NUMA domains. The RDMA registration path burns cycles because the stack keeps acting as if each transfer is unrelated to the last. None of that shows up cleanly if you only plot GB/s.

real storage latency in AI =
  media completion
+ controller scheduling
+ interrupt delivery
+ blk-mq completion path
+ io_uring or page-cache handoff
+ userspace wakeup
+ DMA / copy / registration side effects
+ possible RDMA or GPU transfer continuation

Once systems become orchestration-heavy or retrieval-heavy, small repeated coordination costs matter more than people expect. A GPU can sit idle because of a storage path that is technically healthy by classic metrics and still wrong for the workload.

Key point: the stack is too often tuned for average bandwidth while the workload depends on the latency shape of thousands of small control decisions.

2. The Stack Is Too Generic

The path from storage media to application state crosses too many layers that currently know too little about intent.

SSD controller

Understands flash timing, background work, queue scheduling, and internal DRAM, but usually sees host requests as generic reads and writes with weak semantic hints.

Linux block layer

Optimizes for queue parallelism and fairness, but often lacks a first-class notion of step-critical or latency-sensitive storage work.

io_uring / page cache

Reduces syscall overhead, but does not automatically solve completion locality, batching mistakes, or interaction with page ownership and wakeups.

RDMA and GPU path

Provides transport efficiency, but still leaves policy above transport mostly unowned: placement, admission, locality, registration reuse, and budget enforcement.

That fragmentation is why people often say “storage is fine” while the application still experiences stalls. The system is missing an end-to-end latency policy.

Another way to say it: the stack has plenty of mechanisms and too little intent. What is missing is not “more I/O.” It is shared agreement on which completions matter most, where they should land, and how much latency variance is acceptable.

3. The Hidden Tax in SSD and Block I/O

Linux block I/O on modern NVMe hardware is fast enough that the remaining taxes become architectural rather than obvious. They include completion CPU mismatch, excessive coalescing, generic worker fallback, memory locality drift, and poor attribution of delays after the media has already finished work.

LayerClassic optimization targetAI-era problemBetter target
SSD controllerParallelism, wear, garbage collectionNo strong host-visible hint for step-critical readsDeadline classes and confidence-aware prefetch hints
NVMe queuesHigh queue depthQueue depth can hide latency variance instead of fixing itMode switch between throughput and bounded-latency behavior
blk-mqScalable dispatch and completionCompletion locality not preserved end-to-endCompletion-to-consumer affinity as a first-class policy
io_uringLow syscall overheadCQE visibility and task_work timing still varyLatency-mode CQE delivery and wakeup shaping
RDMALow-copy transportRegistration and placement policy still expensiveLong-lived memory intent and transport-aware control plane

4. Completion Locality Matters More Than Another Queue

The storage stack still acts as if finishing the I/O is the main goal. For AI systems, finishing the I/O on the right CPU with the right memory ownership is usually more important than shaving a tiny amount off raw dispatch overhead.

The storage path that actually mattersThe media can be done while the application is still effectively waiting.SSD donemedia completeController + IRQMSI-X deliveryblk-mqcompletion pathio_uring / cacheownership handoffWakeupright CPU?Usenext stepOptimization target:keep completion, memory ownership, and userspace resumption on the same locality path.
Figure 1. Storage performance should be measured by completion usefulness, not just completion existence.

That is why queue affinity is not a low-level detail anymore. It is product behavior.

5. io_uring Helped, But It Did Not Finish the Job

io_uring removed plenty of userspace/kernel friction, but many teams over-read what that means. It made async composition cheaper. It did not suddenly give the storage stack a policy brain.

  • The CQE may still show up later than the application really wanted because of batching or task_work timing.
  • The completion path may still land on a CPU that is wrong for the consumer.
  • The memory touched next may still be remote, cold, or trapped behind a registration/copy decision that lives outside the ring.

In AI systems, io_uring should be thought of as a mechanism that needs a stronger latency mode above it, not as the final answer.

6. RDMA Is Powerful and Still Incomplete

RDMA solved a transport problem brilliantly. It did not solve the system problem around transport. That missing layer is becoming more obvious now that AI clusters care about memory placement, storage-backed prefetch, GPU-feeding latency, and multi-tenant cost control.

What RDMA already does well

Low-copy movement, queue pairs, completion queues, transport offload, and high-throughput low-CPU data transfer across fabrics.

What it still does poorly as a system

Intent expression, locality policy, long-lived registration reuse, storage alignment, and application-visible latency contracts.

The missing abstraction is not another verb. It is a control plane above the verbs that can say:

this transfer is:
  step-critical
  likely to be consumed by CPU X / GPU Y
  expected to recur
  worth pre-registration
  allowed to bypass generic batching
  subject to a latency budget rather than a throughput-only budget

7. The Measurement Model Is Wrong Too Often

One reason this area stays under-optimized is that the wrong counters still dominate reviews. Teams look at SSD bandwidth, average latency, queue depth, or network throughput and conclude the stack is healthy. For AI systems, I would care more about these:

MetricWhy it mattersLikely root cause when bad
device_done_to_userspace_readyCaptures whether completion was actually useful when it arrivedIRQ mismatch, blk-mq deferral, task_work lag, wakeup noise
completion_cpu != consumer_cpu ratioShows whether locality is being preserved or thrown awayQueue affinity mistakes, generic interrupt spray, weak steering
registration_reuse hit rateMakes RDMA and GPU-adjacent memory churn visibleShort-lived mappings, transport policy ignorance, poor buffer lifecycle
storage_to_gpu continuation gapMeasures how long it takes for bytes to become accelerator-usableBounce buffers, copy paths, IOMMU overhead, wakeup mismatch
tracepoints I would join:
  block_rq_issue
  block_rq_complete
  nvme completion hooks
  io_uring CQE visibility
  sched_wakeup
  sched_switch
  dma_fence or GPU handoff markers when available
  RDMA CQ handling

derived signals:
  useful_completion_ns
  locality_miss_rate
  deferred_completion_rate
  registration_churn_per_flow

8. The Storage-to-GPU Path Is Still Too Wasteful

Even in systems that advertise direct paths, the reality is often messier. There are still hidden bounce buffers, registration churn, IOMMU overhead, pinned-memory lifecycle costs, userspace wakeup mismatch, and queueing decisions that were made for old throughput assumptions instead of modern accelerator economics.

The expensive mistake: buying faster accelerators while leaving the storage-to-GPU feed path governed by generic server defaults.

9. What I Would Change First

PriorityAreaChangeWhy it matters
1Completion localityAlign SSD queue affinity, blk-mq completion CPU, and userspace consumer CPUReduces the repeated boundary tax after the media is already done
2Latency modeAdd bounded-latency operating modes across controller, block layer, and io_uringLets the stack stop pretending every workload wants maximum queue depth
3HintingExpose step-critical, hot metadata, and prefetch-confidence hints to the storage stackControllers and the host need better semantics than generic reads
4Registration reuseTreat repeated RDMA and GPU-adjacent memory regions as first-class, long-lived objectsRegistration churn is wasted CPU time in stable pipelines
5ObservabilityTrace from NVMe completion through wakeup and onward to GPU-feed continuationYou cannot fix what the stack refuses to attribute

10. A Better Way to Think About the Stack

The storage path should not be decomposed into isolated teams each maximizing its own local metric. It should be treated as a coordinated latency surface with shared ownership.

desired policy loop:
  application declares intent
    -> controller and host queues choose latency mode
    -> completion steered to consumer locality
    -> io_uring or cache hands off without generic detours
    -> memory registration / DMA reuse honored
    -> possible RDMA or GPU continuation inherits same policy
    -> instrumentation records end-to-end usefulness

That is how you stop the system from wasting tens or hundreds of microseconds at every boundary and then hiding behind good aggregate bandwidth numbers.

11. Why This Matters More for Agents and Retrieval

Agentic systems, retrieval-heavy systems, and disaggregated inference systems all amplify storage path flaws because they chain many dependent operations together. They are not just “reading data.” They are reading data that gates the next decision. That means the kernel and controller path becomes part of the reasoning loop.

12. Final View

The next storage win for AI is probably not a new marketing number on the SSD box. It is a control-plane upgrade across SSD controller policy, Linux completion locality, io_uring latency behavior, memory registration, and RDMA semantics. The hardware is already fast enough that the coordination mistakes now dominate more than people want to admit.

The short version: the AI storage stack is still too generic. The systems that win will be the ones that treat storage completions as latency-sensitive orchestration events, not just successful I/O.