← All posts
Kernel Architecture Deep Dive

Tuning the Linux Kernel for AI & GPU Servers

A kernel engineer's guide to the subsystems that silently throttle your training clusters — and the low-level changes that can unlock real performance.

⬡ AI Infrastructure ⬡ Linux Kernel Internals ⬡ GPU Systems ⬡ Deep Dive
00

The invisible bottleneck

When an AI training run underperforms, the instinct is to blame the model architecture, the optimizer, or the GPU count. But the real culprit is often the operating system sitting between your code and the hardware. The Linux kernel — specifically six subsystems — can silently consume 10–40% of your theoretical compute budget. This post walks through each one, from a kernel architect's perspective, explaining exactly what breaks and what to fix.

Scope

This analysis targets dense GPU clusters (8–512 GPUs) running PyTorch / JAX training or high-throughput inference. The kernel version baseline is Linux 6.6+ (EEVDF era). Hardware assumed: AMD EPYC or Intel Xeon, NVIDIA A100/H100 or AMD MI300X, InfiniBand / RoCE fabric.

Hardware — PCIe · NVLink · DRAM · NUMA · NICs · NVMe IOMMU / DMA P2PDMA, IOVA cache ATS, PASID Memory subsystem Huge pages · NUMA madvise · compaction IRQ / Interrupt MSI-X · IRQ affinity NAPI · threaded IRQs PCIe / CXL layer MPS · ASPM · BAR CXL memory tiering Network / RDMA RoCE · InfiniBand XDP · AF_XDP · io_uring CPU scheduler EEVDF · isolcpus SCHED_DEADLINE DRM / GPU driver ★ GEM/TTM memory · fence/timeline drm_sched · preemption · PASID ctx compute context · gang scheduling VFS / Block / io_uring ★ io_uring fixed buffers · NVMe-mq mq-deadline · DAX mmap huge page backing · blk-wbt off ★★ Memory Management — HMM / SVM / ATS / PASID ★★ Heterogeneous Memory Management · Shared Virtual Memory Address Translation Services · GPU-CPU unified VA space Zero-copy host↔device · eliminates PCIe bounce buffers Userspace / frameworks CUDA · ROCm · PyTorch · JAX · NCCL · UCX · libfabric · RDMA verbs vLLM · Triton · TensorRT-LLM · Megatron-LM Highest impact High impact Significant Important
40%
Compute lost to kernel overhead (worst case)
6
Kernel subsystems with significant tuning leverage
1
Single biggest fix: HMM + ATS + PASID unified memory
01

Memory Management: HMM, SVM & ATS

★★ Highest Impact

This is where a kernel architect would start. The root cause of most GPU performance waste is address space fragmentation between the CPU and GPU. Every time PyTorch calls tensor.cuda(), data is physically copied through the PCIe bus using bounce buffers managed by the kernel's DMA layer. On an A100 with 80GB of HBM, filling that memory from the host requires traversing 16 lanes of PCIe Gen4 — at roughly 32 GB/s bidirectional. For a 70B parameter model, that's over 4 seconds just for the initial load, before a single FLOP is computed.

The bounce buffer is not an implementation detail. It is a fundamental architectural flaw that HMM was designed to eliminate.

The kernel's answer is HMM (Heterogeneous Memory Management), found in mm/hmm.c. HMM allows a device's MMU to mirror the CPU's page tables, so that the GPU can take page faults on CPU memory and vice versa — the kernel's migrate_to_ram() callback handles migration transparently. Combined with three additional primitives, HMM becomes transformative:

WITHOUT HMM (bounce buffer path) CPU / DRAM Bounce buffer DMA-coherent alloc PCIe DMA GPU VRAM memcpy → pin → DMA → copy = 3 passes WITH HMM + ATS (zero-copy path) CPU / DRAM IOMMU + ATS shared page tables GPU VA (= CPU VA) direct PCIe read (ATS lookup) GPU faults → IOMMU translates → reads CPU DRAM directly. 0 copies.
Key files to study

mm/hmm.c · drivers/iommu/iommu-sva.c · include/linux/hmm.h · drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c (SVM implementation reference)

02

DRM GPU Scheduler — drm_sched

★★ High Impact

The GPU driver's internal scheduler lives in drivers/gpu/drm/scheduler/ and is one of the most underappreciated bottlenecks in AI workloads. It has three fundamental problems in the context of LLM training and inference.

Preemption granularity. Most GPU schedulers can only preempt at kernel (shader dispatch) boundaries — not mid-wavefront. For LLM inference serving, a long prefill job (processing a 128k-token prompt) holds the GPU for hundreds of milliseconds, blocking all decode jobs behind it. This destroys P99 latency. The fix requires implementing mid-wavefront preemption at the hardware level (available on H100 and MI300X) and surfacing it through drm_sched priority classes.

Fence / timeline ordering. When NCCL launches an AllReduce across 8 GPUs, the kernel's dma_fence / syncobj mechanism serializes what should be a parallel DAG of operations. The current timeline fence model is inherently linear: fence B waits on fence A. Replacing this with proper DAG dependency tracking in the fence layer allows AllReduce ring steps to overlap with compute, recovering 15–20% of collective communication time.

Gang scheduling. In distributed training, all ranks of a job must progress together. If rank 3's GPU context gets preempted while ranks 0–7 are in the middle of an AllReduce, the barrier stall propagates across the entire job. A drm_sched policy that understands gang membership — scheduling all members of a gang together or not at all — eliminates these stalls entirely.

8-GPU training gang — AllReduce barrier alignment GPU 0 GPU 1 GPU 2 GPU 3 GPU 4–7 compute kernel compute kernel compute kernel preempted! barrier / AllReduce all ranks stall waiting for GPU 3 with gang scheduling all 8 GPUs scheduled atomically — no rank preempted mid-gang all ranks in sync
Key files

drivers/gpu/drm/scheduler/sched_main.c · drivers/gpu/drm/scheduler/sched_fence.c · include/drm/gpu_scheduler.h — look for drm_gpu_scheduler struct and drm_sched_entity priority fields.

03

IOMMU / DMA — P2PDMA & IOVA Caching

★★ High Impact

In NVLink or InfiniBand + GPU clusters, the bottleneck is PCIe peer-to-peer DMA. When GPU 0 wants to write to GPU 1's memory, the kernel currently routes it through the CPU's memory controller on many platforms — because the IOMMU's IOVA allocator doesn't understand P2P mappings between devices that aren't memory controllers.

The second, less obvious problem is IOVA allocator scalability. The per-CPU IOVA magazine allocator in drivers/iommu/iova.c was designed for 4K-page workloads (web servers, databases). AI workloads map 2MB to 1GB regions — hundreds of times larger. At 256 CPU cores and 8 GPUs, the magazine allocator becomes a spinlock contention point that adds microseconds of setup overhead to every DMA operation.

04

CPU Scheduler — EEVDF, isolcpus & SCHED_DEADLINE

★ Significant

PyTorch DataLoader workers, NCCL communication threads, and CUDA stream callbacks all compete for CPU time. The standard CFS scheduler has millisecond-scale wakeup latency — catastrophic when a GPU kernel is sitting idle, waiting for a CPU-side fence signal to launch the next wave.

GPU idle time caused by CPU scheduling jitter — timeline view GPU stream CPU (CFS) CPU (isolated) kernel A GPU idle (jitter) kernel B other tasks fence other 1–3ms delay NCCL proxy thread (dedicated core) <200µs isolcpus + nohz_full reduces GPU idle gap by ~15× on fence-heavy workloads
05

Network Stack — XDP, RDMA & Kernel Bypass

★ Significant

For multi-node training (anything beyond a single machine), the network subsystem becomes the bottleneck. NCCL over TCP is wasteful; even RDMA verbs carry unnecessary kernel overhead. The network path has three layers of optimization available.

XDP (eXpress Data Path) keeps the data path entirely in the NIC driver, bypassing the socket layer completely. Combined with AF_XDP sockets, you get performance within 10–15% of DPDK — but with full kernel integration, no separate process, and compatibility with standard networking tools. Moving NCCL's transport to XDP-based packet processing on InfiniBand or RoCE eliminates every sk_buff allocation in the hot path.

io_uring + fixed buffers for RDMA — using io_uring's registered buffers (pinned and mapped once at setup time) for gradient exchange eliminates the per-operation get_user_pages() overhead that dominates small-message RDMA latency. For NCCL's ring-AllReduce, where millions of small ibv_post_send calls are made per training step, this is measurable.

GRO (Generic Receive Offload) tuning — large receive coalescing is tuned for web traffic (small, irregular packets). AI gradient traffic is large, bursty, and periodic — the GRO engine needs gro_flush_timeout tuning and per-flow coalescing hints to handle the predictable burst pattern efficiently.

06

VFS / Block Layer — io_uring & Dataset I/O

★ Important

During training, the data pipeline (reading tokenized sequences, TFRecords, Parquet files) runs on the same machine as the GPU workload. The kernel's block layer and VFS matter more than they appear: a stalled DataLoader can drop GPU utilization from 95% to 60% in seconds.

07

Deep cut: TLB Shootdown Storms

★ Important (often overlooked)

This is the issue that rarely appears in tuning guides but absolutely destroys performance on large-core systems. TLB shootdown storms: when the GPU driver calls munmap() on a large pinned buffer — say, model weights being swapped between two training jobs — the kernel must send inter-processor interrupts (IPIs) to all other CPUs to invalidate their TLB entries for that address range.

On a 256-core EPYC system, a single munmap() on a 40GB buffer triggers 255 TLB invalidation IPIs simultaneously. Every CPU stalls for 2–5µs. Multiply by thousands of mappings per training step.

The fix is architectural and lives in arch/x86/mm/tlb.c. Three mechanisms work together:

08

Summary & prioritization

If you can only change one thing: complete HMM + ATS + PASID support across the IOMMU layer and your GPU driver. Everything else is incremental. After that, the drm_sched gang scheduling policy and IOVA allocator scalability are the next highest-leverage changes.

Subsystem Key change Source location Impact
HMM / SVM / ATS Unified VA space, zero-copy GPU↔CPU mm/hmm.c · drivers/iommu/ ★★ Highest
drm_sched Gang scheduling, DAG fences, preemption drivers/gpu/drm/scheduler/ ★★ Highest
IOMMU / P2PDMA Multi-hop P2P, IOVA large-region allocator drivers/pci/p2pdma.c · drivers/iommu/iova.c ★★ High
CPU scheduler isolcpus, EEVDF latency-nice, SCHED_DEADLINE kernel/sched/fair.c · kernel/sched/deadline.c ★ Significant
Network / RDMA XDP transport, io_uring fixed buffers, GRO tuning net/core/dev.c · drivers/infiniband/ ★ Significant
VFS / Block io_uring fixed files, NVMe-mq none, DAX fs/io_uring.c · block/blk-mq.c ★ Important
TLB / MMU notifier PCID lazy flush, IPI batching, huge page pinning arch/x86/mm/tlb.c · mm/mmu_notifier.c ★ Important
Bottom line

The Linux kernel is not passively neutral in AI/GPU performance — it actively participates in every tensor operation, every gradient exchange, every checkpoint write. Understanding these seven subsystems is the difference between a cluster running at 40% MFU and one running at 70%.