The invisible bottleneck
When an AI training run underperforms, the instinct is to blame the model architecture, the optimizer, or the GPU count. But the real culprit is often the operating system sitting between your code and the hardware. The Linux kernel — specifically six subsystems — can silently consume 10–40% of your theoretical compute budget. This post walks through each one, from a kernel architect's perspective, explaining exactly what breaks and what to fix.
This analysis targets dense GPU clusters (8–512 GPUs) running PyTorch / JAX training or high-throughput inference. The kernel version baseline is Linux 6.6+ (EEVDF era). Hardware assumed: AMD EPYC or Intel Xeon, NVIDIA A100/H100 or AMD MI300X, InfiniBand / RoCE fabric.
Memory Management: HMM, SVM & ATS
This is where a kernel architect would start. The root cause of most GPU performance waste is
address space fragmentation between the CPU and GPU. Every time PyTorch calls
tensor.cuda(), data is physically copied through the PCIe bus using bounce buffers
managed by the kernel's DMA layer. On an A100 with 80GB of HBM, filling that memory from the
host requires traversing 16 lanes of PCIe Gen4 — at roughly 32 GB/s bidirectional. For a 70B
parameter model, that's over 4 seconds just for the initial load, before a single FLOP is computed.
The kernel's answer is HMM (Heterogeneous Memory Management), found in
mm/hmm.c. HMM allows a device's MMU to mirror the CPU's page tables, so that the
GPU can take page faults on CPU memory and vice versa — the kernel's migrate_to_ram()
callback handles migration transparently. Combined with three additional primitives, HMM becomes
transformative:
-
SVM
Shared Virtual Memory — GPU and CPU share a single virtual address space.
A pointer valid on the CPU is valid on the GPU. No explicit copies, no pinning, no staging buffers.
The kernel's MMU notifier infrastructure (
include/linux/mmu_notifier.h) propagates page invalidations to the GPU driver in real time. - ATS Address Translation Services — the GPU uses the CPU's IOMMU page tables directly, over PCIe. This eliminates the shadow page table the GPU driver maintains in VRAM. On EPYC systems with AMD GPUs, ATS over PCIe Gen5 can eliminate 80%+ of IOMMU translation overhead for large contiguous mappings.
- PASID Process Address Space IDs — each GPU compute context gets its own address space tag, allowing multiple processes to share GPU hardware without TLB flushes on context switches. Critical for multi-tenant inference serving (vLLM, TGI).
mm/hmm.c · drivers/iommu/iommu-sva.c · include/linux/hmm.h ·
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c (SVM implementation reference)
DRM GPU Scheduler — drm_sched
The GPU driver's internal scheduler lives in drivers/gpu/drm/scheduler/ and is
one of the most underappreciated bottlenecks in AI workloads. It has three fundamental problems
in the context of LLM training and inference.
Preemption granularity. Most GPU schedulers can only preempt at kernel
(shader dispatch) boundaries — not mid-wavefront. For LLM inference serving, a long prefill
job (processing a 128k-token prompt) holds the GPU for hundreds of milliseconds, blocking all
decode jobs behind it. This destroys P99 latency. The fix requires implementing
mid-wavefront preemption at the hardware level (available on H100 and MI300X) and
surfacing it through drm_sched priority classes.
Fence / timeline ordering. When NCCL launches an AllReduce across 8 GPUs,
the kernel's dma_fence / syncobj mechanism serializes what should be a parallel
DAG of operations. The current timeline fence model is inherently linear: fence B waits on
fence A. Replacing this with proper DAG dependency tracking in the fence layer allows
AllReduce ring steps to overlap with compute, recovering 15–20% of collective communication time.
Gang scheduling. In distributed training, all ranks of a job must progress
together. If rank 3's GPU context gets preempted while ranks 0–7 are in the middle of an
AllReduce, the barrier stall propagates across the entire job. A drm_sched policy
that understands gang membership — scheduling all members of a gang together or not at all —
eliminates these stalls entirely.
drivers/gpu/drm/scheduler/sched_main.c · drivers/gpu/drm/scheduler/sched_fence.c ·
include/drm/gpu_scheduler.h — look for drm_gpu_scheduler struct and
drm_sched_entity priority fields.
IOMMU / DMA — P2PDMA & IOVA Caching
In NVLink or InfiniBand + GPU clusters, the bottleneck is PCIe peer-to-peer DMA. When GPU 0 wants to write to GPU 1's memory, the kernel currently routes it through the CPU's memory controller on many platforms — because the IOMMU's IOVA allocator doesn't understand P2P mappings between devices that aren't memory controllers.
The second, less obvious problem is IOVA allocator scalability. The per-CPU IOVA magazine
allocator in drivers/iommu/iova.c was designed for 4K-page workloads (web servers,
databases). AI workloads map 2MB to 1GB regions — hundreds of times larger. At 256 CPU cores
and 8 GPUs, the magazine allocator becomes a spinlock contention point that adds microseconds
of setup overhead to every DMA operation.
-
P2P
P2PDMA (
drivers/pci/p2pdma.c) — extend to support multi-hop P2P across PCIe switches without bouncing through host memory. AMD's XGMI fabric and NVIDIA's NVLink already bypass the CPU; P2PDMA needs to express this topology to the kernel's DMA layer. - IOVA IOVA cache rework — replace the magazine allocator with per-NUMA radix trees sized for large-region mappings. Benchmark target: DMA setup overhead for a 512MB region should be under 1µs on a 256-core system (currently 8–40µs with lock contention).
-
VFIO
VFIO mediated devices — for bare-metal inference servers, bypassing the
kernel DRM layer entirely via VFIO with mediated device assignment reduces CUDA kernel launch
latency from ~8µs to ~2µs by eliminating the
ioctldispatch chain.
CPU Scheduler — EEVDF, isolcpus & SCHED_DEADLINE
PyTorch DataLoader workers, NCCL communication threads, and CUDA stream callbacks all compete for CPU time. The standard CFS scheduler has millisecond-scale wakeup latency — catastrophic when a GPU kernel is sitting idle, waiting for a CPU-side fence signal to launch the next wave.
-
isol
isolcpus=+nohz_full=— dedicate a set of CPU cores entirely to GPU communication threads. With no OS tick and no competing tasks, wakeup latency drops from 1–3ms to under 200µs. NCCL proxy threads benefit most, since they sit in a tight polling loop waiting for DMA completion events. -
EEVDF
EEVDF (Earliest Eligible Virtual Deadline First) — merged in Linux 6.6,
EEVDF is strictly better than CFS for mixed I/O + compute workloads. Its
latency-nicetuning knob, exposed viasched_attr, allows marking GPU callback threads as latency-sensitive without full real-time priority. -
DL
SCHED_DEADLINE for NCCL proxy threads — using the deadline scheduler
class with
runtime=500µs,period=1msgives the collective communication thread the determinism it needs without requiringCAP_SYS_NICEescalation in production containers.
Network Stack — XDP, RDMA & Kernel Bypass
For multi-node training (anything beyond a single machine), the network subsystem becomes the bottleneck. NCCL over TCP is wasteful; even RDMA verbs carry unnecessary kernel overhead. The network path has three layers of optimization available.
XDP (eXpress Data Path) keeps the data path entirely in the NIC driver,
bypassing the socket layer completely. Combined with AF_XDP sockets, you get performance
within 10–15% of DPDK — but with full kernel integration, no separate process, and compatibility
with standard networking tools. Moving NCCL's transport to XDP-based packet processing on
InfiniBand or RoCE eliminates every sk_buff allocation in the hot path.
io_uring + fixed buffers for RDMA — using io_uring's registered
buffers (pinned and mapped once at setup time) for gradient exchange eliminates the per-operation
get_user_pages() overhead that dominates small-message RDMA latency. For NCCL's
ring-AllReduce, where millions of small ibv_post_send calls are made per training
step, this is measurable.
GRO (Generic Receive Offload) tuning — large receive coalescing is tuned for
web traffic (small, irregular packets). AI gradient traffic is large, bursty, and periodic —
the GRO engine needs gro_flush_timeout tuning and per-flow coalescing hints to
handle the predictable burst pattern efficiently.
VFS / Block Layer — io_uring & Dataset I/O
During training, the data pipeline (reading tokenized sequences, TFRecords, Parquet files) runs on the same machine as the GPU workload. The kernel's block layer and VFS matter more than they appear: a stalled DataLoader can drop GPU utilization from 95% to 60% in seconds.
-
io_uring
io_uringwith fixed files + registered buffers — eliminates everycopy_to_user(), everyget_user_pages(), and every context switch in the hot path. A well-tunedio_uringreader with a submission queue of 4096 entries can saturate 4× NVMe drives at 25 GB/s using a single thread. -
sched
mq-deadlinevsnoneI/O scheduler — for NVMe, thenonescheduler (pass-through to multiqueue) withblk-wbtwriteback throttling disabled gives the lowest tail latency for random-access reads.mq-deadlinemakes sense only when mixing checkpoint writes (sequential) with sample reads (random). -
DAX
DAX (Direct Access) for persistent memory — for CXL-attached DRAM
or Optane, DAX bypasses the page cache entirely. The kernel maps PMEM directly into the
process VA space, making checkpoint saves to PMEM as fast as
memcpy()— transformative for frequent checkpoint intervals in large training runs.
Deep cut: TLB Shootdown Storms
This is the issue that rarely appears in tuning guides but absolutely destroys performance on
large-core systems. TLB shootdown storms: when the GPU driver calls
munmap() on a large pinned buffer — say, model weights being swapped between two
training jobs — the kernel must send inter-processor interrupts (IPIs) to all other CPUs to
invalidate their TLB entries for that address range.
munmap() on a 40GB buffer triggers 255 TLB
invalidation IPIs simultaneously. Every CPU stalls for 2–5µs. Multiply by thousands of
mappings per training step.
The fix is architectural and lives in arch/x86/mm/tlb.c. Three mechanisms work
together:
- PCID PCID (Process Context IDs) with lazy TLB flushing — instead of flushing the entire TLB on context switch, PCID tags each entry with a process ID. CPUs that don't hold stale entries for a given PCID can skip the IPI entirely. Already partially implemented but not fully exploited by GPU drivers.
-
batch
MMU notifier batching —
mmu_notifiercallbacks in the GPU driver currently trigger an immediate IPI broadcast on every page invalidation. Batching invalidations at the GPU side — collecting them over a 50–100µs window and sending a single broadcast — reduces IPI frequency by 100–1000× without visible impact on correctness. -
huge
Huge page pinning — pinning model weights as 2MB huge pages instead
of 4K pages reduces TLB entry count by 512× for the same buffer. Fewer entries means fewer
invalidations and a smaller shootdown blast radius. Achievable via
madvise(ptr, size, MADV_HUGEPAGE)before the first GPU mapping.
Summary & prioritization
If you can only change one thing: complete HMM + ATS + PASID support across
the IOMMU layer and your GPU driver. Everything else is incremental. After that, the
drm_sched gang scheduling policy and IOVA allocator scalability are the next
highest-leverage changes.
| Subsystem | Key change | Source location | Impact |
|---|---|---|---|
| HMM / SVM / ATS | Unified VA space, zero-copy GPU↔CPU | mm/hmm.c · drivers/iommu/ |
★★ Highest |
| drm_sched | Gang scheduling, DAG fences, preemption | drivers/gpu/drm/scheduler/ |
★★ Highest |
| IOMMU / P2PDMA | Multi-hop P2P, IOVA large-region allocator | drivers/pci/p2pdma.c · drivers/iommu/iova.c |
★★ High |
| CPU scheduler | isolcpus, EEVDF latency-nice, SCHED_DEADLINE | kernel/sched/fair.c · kernel/sched/deadline.c |
★ Significant |
| Network / RDMA | XDP transport, io_uring fixed buffers, GRO tuning | net/core/dev.c · drivers/infiniband/ |
★ Significant |
| VFS / Block | io_uring fixed files, NVMe-mq none, DAX | fs/io_uring.c · block/blk-mq.c |
★ Important |
| TLB / MMU notifier | PCID lazy flush, IPI batching, huge page pinning | arch/x86/mm/tlb.c · mm/mmu_notifier.c |
★ Important |
The Linux kernel is not passively neutral in AI/GPU performance — it actively participates in every tensor operation, every gradient exchange, every checkpoint write. Understanding these seven subsystems is the difference between a cluster running at 40% MFU and one running at 70%.