1. The Layered Controller Model
The word “controller” means different things depending on where you stand in the stack. There is often a controller inside the drive, a controller or host adapter visible on PCIe or the motherboard, and then a Linux driver that binds to the host-visible device. These are not interchangeable layers.
This is why the kernel paths you care about are things like drivers/ata/ahci.c,
drivers/nvme/host/pci.c, drivers/scsi/smartpqi/, or
drivers/pci/controller/vmd.c, not “controller inside vendor SSD X.”
2. Host Controllers: AHCI, NVMe, SAS HBAs, RAID, and VMD
AHCI / SATA
AHCI standardized SATA host behavior. In Linux the canonical low-level driver is drivers/ata/ahci.c,
layered over libata. AHCI was a major unification step, but it reflects an older storage era and does not express the same queue-native,
PCIe-native parallelism as NVMe. citeturn0search0turn0search2
NVMe
NVMe is not just “faster SSDs.” It is a different controller model: PCIe-native, queue-heavy, and designed for parallel flash.
In Linux the host stack lives under drivers/nvme/host/, with PCI transport in
drivers/nvme/host/pci.c. The NVM Express organization points Linux users directly to the in-kernel driver. citeturn0search1turn0search8
SAS HBAs
SAS HBA controllers matter in enterprise servers and storage arrays where expanders, dual paths, and large topologies matter.
Broadcom’s long-running ecosystems around mpt3sas and newer mpi3mr
are central here. citeturn0search6turn0search7
RAID / Smart Array
A real hardware RAID or SmartRAID controller is effectively a storage computer on a card: it has firmware, queueing logic, topology abstraction, configuration metadata, and sometimes cache or cache protection. The operating system may see virtual disks rather than raw drives.
Intel VMD
Intel’s Volume Management Device inserts a platform-level storage control layer around NVMe domains. It is useful for hot-plug, enterprise manageability,
and platform partitioning of NVMe devices. Intel’s VMD white paper notes that the Linux VMD driver has been in-kernel since Linux 4.5, and the current source
lives in drivers/pci/controller/vmd.c. citeturn0search5turn0search9
3. Vendor Landscape and Why the Names Matter
megaraid_sas, mpt3sas, and mpi3mr. citeturn0search6turn0search7
smartpqi. Kernel docs explicitly identify smartpqi as the next-generation SCSI driver for Microchip PQI controllers. citeturn0search3
4. Linux Kernel Driver Map
| Storage function | Kernel path / driver | What it controls |
|---|---|---|
| SATA / AHCI host | drivers/ata/ahci.c |
AHCI low-level SATA HBA behavior on top of libata |
| ATA/SATA framework | libata | Shared ATA/SATA host infrastructure and helpers |
| NVMe PCI host | drivers/nvme/host/pci.c |
PCI transport for Linux NVMe host stack |
| NVMe core | drivers/nvme/host/ |
Core namespace, queue, transport, fabrics, and reset logic |
| MegaRAID | drivers/scsi/megaraid/ |
Broadcom/LSI hardware RAID family |
| SAS HBA | drivers/scsi/mpt3sas/ |
Broadcom/LSI SAS/SATA HBA family |
| MPI3 controllers | drivers/scsi/mpi3mr/ |
Newer generation tri-mode Broadcom controllers |
| SmartRAID / SmartHBA | drivers/scsi/smartpqi/ |
Microchip PQI-based smart storage controllers |
| Intel VMD | drivers/pci/controller/vmd.c |
Platform-level NVMe domain/control layer |
5. The Full Path Begins with read()
Now we switch from taxonomy to actual execution. Suppose an AI training or inference pipeline needs bytes from storage. In userspace, that often starts conceptually as a file read—even if the upper layers are a dataloader, framework runtime, or dataset library.
Userspace -> read(fd, buf, size) -> VFS path resolution -> page cache lookup -> if miss: create block I/O -> blk-mq request mapping -> storage driver submission -> controller + DMA path
VFS and page cache
The Linux Virtual Filesystem resolves the file descriptor, inode, and filesystem operations. Then the page cache is checked. If the data is already in memory, the storage device is never touched. If it is not, Linux creates block I/O operations.
Block layer and blk-mq
Modern Linux uses blk-mq, the multi-queue block layer. This matters because it reduces lock contention and maps much more naturally to modern controllers— especially NVMe devices that are themselves queue-rich. The logical shape is often:
bio -> request -> hardware queue -> driver -> device queue(s)
This is already one place where a smart controller matters. If the hardware queue model is shallow, the software can only do so much.
6. NVMe Submission/Completion Queues and Doorbells
NVMe is worth understanding because it is the cleanest modern example of a storage controller interface that exposes parallelism rather than hiding it. The high-level flow is:
CPU / driver writes command entry -> Submission Queue (SQ) CPU performs MMIO write -> doorbell register Controller fetches command Controller executes I/O Controller writes result -> Completion Queue (CQ) MSI-X interrupt or polling path signals completion Driver reclaims request
The doorbell is important conceptually: it is the MMIO signal from host to controller saying “new work is available.” This is one of the places where protocol design directly shapes software efficiency.
7. DMA, IOMMU, and the Role of System DRAM
Once the request is accepted, data movement usually happens through DMA: the controller moves data directly into system memory without the CPU copying every byte itself.
At this stage, the conventional path in an AI server is:
Storage media -> drive controller -> host controller -> PCIe -> DMA -> system DRAM
That DRAM step is not just a passive buffer. It becomes the staging ground for preprocessing, page cache reuse, framework handoff, and eventually the CPU-to-GPU transfer.
8. CPU Memory to GPU HBM
Once the bytes are in system memory, AI frameworks still have to get them into GPU memory. Traditionally that means another DMA-driven step:
DRAM -> PCIe or NVLink path -> GPU memory controller -> HBM
On the host side, runtime libraries such as CUDA manage memory registration, copies, and asynchronous transfer orchestration. On the device side, the GPU memory subsystem ultimately needs the bytes staged into HBM before compute kernels can consume them at full speed.
And that is exactly why GPUDirect Storage exists: if the CPU/DRAM staging path is too expensive, the system tries to remove it.
9. GPUDirect Storage: Removing the CPU as Middleman
NVIDIA describes GPUDirect Storage as a direct data path between storage and GPU memory that avoids unnecessary bounces through CPU memory. The core motivation is simple: if datasets and checkpoints are large enough, routing everything through the CPU becomes a tax on latency and bandwidth. citeturn0search10turn0search11
GPUDirect Storage does not make the storage hierarchy vanish. It makes weak points easier to see. If the SSD tier is slow or the controller path is poor, bypassing the CPU just reveals that more clearly.
10. NVMe-over-Fabrics and the Cluster Version of the Same Story
In clusters, the path may extend beyond the local server. NVMe-over-Fabrics lets a remote NVMe target be accessed over transports such as RDMA or TCP. Conceptually, the data path becomes:
Remote NVMe target -> network fabric -> NIC / transport stack -> host memory or GPU path
This matters in AI because the larger the working set and the larger the training/inference clusters, the more attractive remote and disaggregated storage become. But the same lesson holds: every added controller or transport layer adds new queueing, DMA, error handling, and congestion behavior.
11. Timeouts, Resets, and What Breaks
Real systems are defined by failure paths as much as fast paths. A storage request can timeout. A queue can wedge. A device may need reset or re-enumeration. This is one reason smart controllers and robust drivers matter so much.
Failure path sketch ------------------- I/O timeout -> mark request as failed / suspect -> freeze queues -> reset controller or transport -> rebuild admin / I/O queues -> rediscover devices / namespaces / targets -> re-enable I/O
This is where enterprise storage-controller quality becomes visible. The fast path gets benchmarks. The reset path determines whether a fleet stays up.
12. Why This Matters in AI Servers
AI servers make storage-controller behavior more visible because accelerators are expensive and hungry. A poorly designed or badly matched storage path can leave GPUs idle. That is not just a software bug. It is a capital-efficiency failure.
Training
Needs sustained ingest and checkpoint behavior. Queueing, DMA, and controller throughput matter.
Inference
Needs reliable hot data access and often retrieval-heavy behavior. Latency and mixed I/O behavior matter.
Cluster storage
Needs topology, manageability, and failure recovery. HBA, RAID, VMD, and NVMe-oF layers become visible.
GPU efficiency
The slower the storage-control plane, the easier it is to waste accelerator dollars on starvation.
That is the unifying takeaway. The media stores the bits. The internal controller makes the media usable. The host controller makes the device manageable. The Linux driver makes it real. And the AI stack only benefits if all of those layers work together efficiently.
Selected references
- Linux kernel: libata developer guide
- Linux kernel source: AHCI
- Linux kernel: smartpqi documentation
- Linux kernel source: smartpqi
- Intel VMD white paper
- Linux kernel source: Intel VMD
- Linux kernel source: mpt3sas
- Linux kernel source: mpi3mr
- NVM Express: Linux driver information
- Linux kernel source: NVMe PCI host
- NVIDIA GPUDirect Storage
- NVIDIA GPUDirect Storage overview
This article references kernel file paths and public documentation directly and focuses on architecture and flow rather than reproducing large code excerpts.