Systems · Storage · Kernel · Controllers · AI Data Path

The Brains Behind AI Storage, v2: Smart Controllers, Linux Drivers, and the Full Path from read() to GPU

This version ties the whole story together. It explains smart storage controllers and host adapters, maps them to real Linux kernel drivers, and then follows the actual data path inside an AI server from a userspace read() all the way to GPU HBM. The core point is simple: storage performance is not just about media. It is about the chain of controllers, queues, DMA engines, and drivers that transform bytes into usable accelerator input.

Publish-ready v2AHCI · NVMe · SAS · RAID · VMDread() → blk-mq → DMA → GPU

The thesis

In an AI server, data does not “come from disk to GPU” in one step. It crosses a layered control plane: drive firmware, host-visible storage controllers, Linux kernel drivers, DMA and IOMMU machinery, CPU memory, and finally the GPU memory system. The more expensive the GPU, the more valuable it becomes to understand every layer in that chain.

Drive firmwareManages flash blocks, servo, ECC, caches, and media-specific recovery inside the drive.
Host controllerExposes AHCI, NVMe, SAS, RAID, PQI, or VMD-style control surfaces to the OS.
Kernel driverOwns probe, queue setup, DMA, MSI/MSI-X, reset, and device exposure upward to block or SCSI layers.
AI pathBytes move from storage media to DRAM and then to HBM—or directly to HBM with GPUDirect Storage.

1. The Layered Controller Model

The word “controller” means different things depending on where you stand in the stack. There is often a controller inside the drive, a controller or host adapter visible on PCIe or the motherboard, and then a Linux driver that binds to the host-visible device. These are not interchangeable layers.

Drive-internal controller Hidden behind the protocol boundary. In SSDs it manages NAND channels, FTL, ECC, garbage collection, and wear leveling. In HDDs it manages servo, read-channel processing, caches, ECC, and retry behavior.
Host-visible controller The PCIe or chipset-facing interface the kernel actually probes: AHCI, NVMe, SAS HBA, RAID adapter, PQI SmartRAID, or Intel VMD.
Kernel driver The Linux code that sets up DMA, queues, interrupts, reset logic, and then presents block devices, namespaces, or SCSI targets upward.
Linux usually talks to the host-visible controller interface, not directly to the deepest firmware logic inside every SSD or HDD.

This is why the kernel paths you care about are things like drivers/ata/ahci.c, drivers/nvme/host/pci.c, drivers/scsi/smartpqi/, or drivers/pci/controller/vmd.c, not “controller inside vendor SSD X.”

2. Host Controllers: AHCI, NVMe, SAS HBAs, RAID, and VMD

AHCI / SATA

AHCI standardized SATA host behavior. In Linux the canonical low-level driver is drivers/ata/ahci.c, layered over libata. AHCI was a major unification step, but it reflects an older storage era and does not express the same queue-native, PCIe-native parallelism as NVMe. citeturn0search0turn0search2

NVMe

NVMe is not just “faster SSDs.” It is a different controller model: PCIe-native, queue-heavy, and designed for parallel flash. In Linux the host stack lives under drivers/nvme/host/, with PCI transport in drivers/nvme/host/pci.c. The NVM Express organization points Linux users directly to the in-kernel driver. citeturn0search1turn0search8

SAS HBAs

SAS HBA controllers matter in enterprise servers and storage arrays where expanders, dual paths, and large topologies matter. Broadcom’s long-running ecosystems around mpt3sas and newer mpi3mr are central here. citeturn0search6turn0search7

RAID / Smart Array

A real hardware RAID or SmartRAID controller is effectively a storage computer on a card: it has firmware, queueing logic, topology abstraction, configuration metadata, and sometimes cache or cache protection. The operating system may see virtual disks rather than raw drives.

Intel VMD

Intel’s Volume Management Device inserts a platform-level storage control layer around NVMe domains. It is useful for hot-plug, enterprise manageability, and platform partitioning of NVMe devices. Intel’s VMD white paper notes that the Linux VMD driver has been in-kernel since Linux 4.5, and the current source lives in drivers/pci/controller/vmd.c. citeturn0search5turn0search9

3. Vendor Landscape and Why the Names Matter

Broadcom / LSI Major in enterprise HBAs and RAID. Linux-visible footprints include megaraid_sas, mpt3sas, and mpi3mr. citeturn0search6turn0search7
Microchip / Adaptec SmartRAID / SmartHBA family, represented in Linux by smartpqi. Kernel docs explicitly identify smartpqi as the next-generation SCSI driver for Microchip PQI controllers. citeturn0search3
Intel Not the classic RAID-card story here; instead the interesting angle is VMD and platform-level NVMe domain control. citeturn0search5turn0search9
Drive makers Seagate, Western Digital, Sandisk, Samsung, and others matter because their internal controller firmware shapes media behavior, but that layer is usually hidden from Linux behind NVMe/SATA/SAS.
For drives, the most important controller may be invisible to Linux. For HBAs and RAID adapters, the controller is exactly what Linux binds to.

4. Linux Kernel Driver Map

Storage function Kernel path / driver What it controls
SATA / AHCI host drivers/ata/ahci.c AHCI low-level SATA HBA behavior on top of libata
ATA/SATA framework libata Shared ATA/SATA host infrastructure and helpers
NVMe PCI host drivers/nvme/host/pci.c PCI transport for Linux NVMe host stack
NVMe core drivers/nvme/host/ Core namespace, queue, transport, fabrics, and reset logic
MegaRAID drivers/scsi/megaraid/ Broadcom/LSI hardware RAID family
SAS HBA drivers/scsi/mpt3sas/ Broadcom/LSI SAS/SATA HBA family
MPI3 controllers drivers/scsi/mpi3mr/ Newer generation tri-mode Broadcom controllers
SmartRAID / SmartHBA drivers/scsi/smartpqi/ Microchip PQI-based smart storage controllers
Intel VMD drivers/pci/controller/vmd.c Platform-level NVMe domain/control layer

5. The Full Path Begins with read()

Now we switch from taxonomy to actual execution. Suppose an AI training or inference pipeline needs bytes from storage. In userspace, that often starts conceptually as a file read—even if the upper layers are a dataloader, framework runtime, or dataset library.

Userspace
  -> read(fd, buf, size)
  -> VFS path resolution
  -> page cache lookup
  -> if miss: create block I/O
  -> blk-mq request mapping
  -> storage driver submission
  -> controller + DMA path

VFS and page cache

The Linux Virtual Filesystem resolves the file descriptor, inode, and filesystem operations. Then the page cache is checked. If the data is already in memory, the storage device is never touched. If it is not, Linux creates block I/O operations.

Block layer and blk-mq

Modern Linux uses blk-mq, the multi-queue block layer. This matters because it reduces lock contention and maps much more naturally to modern controllers— especially NVMe devices that are themselves queue-rich. The logical shape is often:

bio -> request -> hardware queue -> driver -> device queue(s)

This is already one place where a smart controller matters. If the hardware queue model is shallow, the software can only do so much.

6. NVMe Submission/Completion Queues and Doorbells

NVMe is worth understanding because it is the cleanest modern example of a storage controller interface that exposes parallelism rather than hiding it. The high-level flow is:

CPU / driver writes command entry -> Submission Queue (SQ)
CPU performs MMIO write -> doorbell register
Controller fetches command
Controller executes I/O
Controller writes result -> Completion Queue (CQ)
MSI-X interrupt or polling path signals completion
Driver reclaims request
CPU / Driver Submission Queue NVMe Controller Completion Queue write command ring doorbell execute DMA + media I/O completion + MSI-X Queues + doorbells are the language of the host-controller conversation
NVMe exposes a much more explicit queueing model than older storage interfaces, which is one reason it fits modern SSDs so well.

The doorbell is important conceptually: it is the MMIO signal from host to controller saying “new work is available.” This is one of the places where protocol design directly shapes software efficiency.

7. DMA, IOMMU, and the Role of System DRAM

Once the request is accepted, data movement usually happens through DMA: the controller moves data directly into system memory without the CPU copying every byte itself.

DMA engineThe controller or device uses DMA engines to move data efficiently into memory buffers prepared by the kernel.
IOMMUTranslates device-visible addresses into safe host memory mappings, especially important in virtualization and secure isolation.
Pinned / prepared memoryThe kernel needs memory mappings the device can safely target; pageable memory semantics matter a lot here.
Interrupt or poll completionOnce DMA completes, the CPU learns about it via MSI/MSI-X interrupt or polling logic.

At this stage, the conventional path in an AI server is:

Storage media -> drive controller -> host controller -> PCIe -> DMA -> system DRAM

That DRAM step is not just a passive buffer. It becomes the staging ground for preprocessing, page cache reuse, framework handoff, and eventually the CPU-to-GPU transfer.

8. CPU Memory to GPU HBM

Once the bytes are in system memory, AI frameworks still have to get them into GPU memory. Traditionally that means another DMA-driven step:

DRAM -> PCIe or NVLink path -> GPU memory controller -> HBM

On the host side, runtime libraries such as CUDA manage memory registration, copies, and asynchronous transfer orchestration. On the device side, the GPU memory subsystem ultimately needs the bytes staged into HBM before compute kernels can consume them at full speed.

The traditional AI storage path is really two staged transfers: storage to DRAM, then DRAM to HBM.

And that is exactly why GPUDirect Storage exists: if the CPU/DRAM staging path is too expensive, the system tries to remove it.

9. GPUDirect Storage: Removing the CPU as Middleman

NVIDIA describes GPUDirect Storage as a direct data path between storage and GPU memory that avoids unnecessary bounces through CPU memory. The core motivation is simple: if datasets and checkpoints are large enough, routing everything through the CPU becomes a tax on latency and bandwidth. citeturn0search10turn0search11

Traditional path Storage CPU DRAM GPU GPUDirect Storage path Storage GPU
The point of GPUDirect Storage is not magic. It is simply to remove an expensive and often unnecessary middle step.

GPUDirect Storage does not make the storage hierarchy vanish. It makes weak points easier to see. If the SSD tier is slow or the controller path is poor, bypassing the CPU just reveals that more clearly.

10. NVMe-over-Fabrics and the Cluster Version of the Same Story

In clusters, the path may extend beyond the local server. NVMe-over-Fabrics lets a remote NVMe target be accessed over transports such as RDMA or TCP. Conceptually, the data path becomes:

Remote NVMe target -> network fabric -> NIC / transport stack -> host memory or GPU path

This matters in AI because the larger the working set and the larger the training/inference clusters, the more attractive remote and disaggregated storage become. But the same lesson holds: every added controller or transport layer adds new queueing, DMA, error handling, and congestion behavior.

11. Timeouts, Resets, and What Breaks

Real systems are defined by failure paths as much as fast paths. A storage request can timeout. A queue can wedge. A device may need reset or re-enumeration. This is one reason smart controllers and robust drivers matter so much.

Request timeoutA command takes too long; the driver or upper layer decides the controller/device may be unhealthy.
Queue freezeThe driver may stop accepting new I/O while recovery is attempted.
Controller resetAdmin queues, interrupts, and internal state may need to be rebuilt.
Re-scan / recoveryDevices, namespaces, or volumes may need to be rediscovered and handed back upward.
Failure path sketch
-------------------
I/O timeout
  -> mark request as failed / suspect
  -> freeze queues
  -> reset controller or transport
  -> rebuild admin / I/O queues
  -> rediscover devices / namespaces / targets
  -> re-enable I/O

This is where enterprise storage-controller quality becomes visible. The fast path gets benchmarks. The reset path determines whether a fleet stays up.

12. Why This Matters in AI Servers

AI servers make storage-controller behavior more visible because accelerators are expensive and hungry. A poorly designed or badly matched storage path can leave GPUs idle. That is not just a software bug. It is a capital-efficiency failure.

Training

Needs sustained ingest and checkpoint behavior. Queueing, DMA, and controller throughput matter.

Inference

Needs reliable hot data access and often retrieval-heavy behavior. Latency and mixed I/O behavior matter.

Cluster storage

Needs topology, manageability, and failure recovery. HBA, RAID, VMD, and NVMe-oF layers become visible.

GPU efficiency

The slower the storage-control plane, the easier it is to waste accelerator dollars on starvation.

The GPU is only as fast as the slowest controller chain feeding it.

That is the unifying takeaway. The media stores the bits. The internal controller makes the media usable. The host controller makes the device manageable. The Linux driver makes it real. And the AI stack only benefits if all of those layers work together efficiently.