Linux Kernel Internals for AI/GPU Servers

00

The invisible bottleneck

When an AI training run underperforms, the instinct is to blame the model architecture, the optimizer, or the GPU count. But the real culprit is often the operating system sitting between your code and the hardware. The Linux kernel — specifically six subsystems — can silently consume 10–40% of your theoretical compute budget. This post walks through each one, from a kernel architect's perspective, explaining exactly what breaks and what to fix.

Scope

This analysis targets dense GPU clusters (8–512 GPUs) running PyTorch / JAX training or high-throughput inference. The kernel version baseline is Linux 6.6+ (EEVDF era). Hardware assumed: AMD EPYC or Intel Xeon, NVIDIA A100/H100 or AMD MI300X, InfiniBand / RoCE fabric.

40%

Compute lost to kernel overhead (worst case)

6

Kernel subsystems with significant tuning leverage

1

Single biggest fix: HMM + ATS + PASID unified memory

01

Memory Management: HMM, SVM & ATS

★★ Highest Impact

This is where a kernel architect would start. The root cause of most GPU performance waste is address space fragmentation between the CPU and GPU. Every time PyTorch calls tensor.cuda(), data is physically copied through the PCIe bus using bounce buffers managed by the kernel's DMA layer. On an A100 with 80GB of HBM, filling that memory from the host requires traversing 16 lanes of PCIe Gen4 — at roughly 32 GB/s bidirectional. For a 70B parameter model, that's over 4 seconds just for the initial load, before a single FLOP is computed.

The bounce buffer is not an implementation detail. It is a fundamental architectural flaw that HMM was designed to eliminate.

The kernel's answer is HMM (Heterogeneous Memory Management), found in mm/hmm.c. HMM allows a device's MMU to mirror the CPU's page tables, so that the GPU can take page faults on CPU memory and vice versa — the kernel's migrate_to_ram() callback handles migration transparently. Combined with three additional primitives, HMM becomes transformative:

SVM Shared Virtual Memory — GPU and CPU share a single virtual address space. A pointer valid on the CPU is valid on the GPU. No explicit copies, no pinning, no staging buffers. The kernel's MMU notifier infrastructure (include/linux/mmu_notifier.h) propagates page invalidations to the GPU driver in real time.
ATS Address Translation Services — the GPU uses the CPU's IOMMU page tables directly, over PCIe. This eliminates the shadow page table the GPU driver maintains in VRAM. On EPYC systems with AMD GPUs, ATS over PCIe Gen5 can eliminate 80%+ of IOMMU translation overhead for large contiguous mappings.
PASID Process Address Space IDs — each GPU compute context gets its own address space tag, allowing multiple processes to share GPU hardware without TLB flushes on context switches. Critical for multi-tenant inference serving (vLLM, TGI).

Key files to study

mm/hmm.c · drivers/iommu/iommu-sva.c · include/linux/hmm.h · drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c (SVM implementation reference)

02

DRM GPU Scheduler — `drm_sched`

★★ High Impact

The GPU driver's internal scheduler lives in drivers/gpu/drm/scheduler/ and is one of the most underappreciated bottlenecks in AI workloads. It has three fundamental problems in the context of LLM training and inference.

Preemption granularity. Most GPU schedulers can only preempt at kernel (shader dispatch) boundaries — not mid-wavefront. For LLM inference serving, a long prefill job (processing a 128k-token prompt) holds the GPU for hundreds of milliseconds, blocking all decode jobs behind it. This destroys P99 latency. The fix requires implementing mid-wavefront preemption at the hardware level (available on H100 and MI300X) and surfacing it through drm_sched priority classes.

Fence / timeline ordering. When NCCL launches an AllReduce across 8 GPUs, the kernel's dma_fence / syncobj mechanism serializes what should be a parallel DAG of operations. The current timeline fence model is inherently linear: fence B waits on fence A. Replacing this with proper DAG dependency tracking in the fence layer allows AllReduce ring steps to overlap with compute, recovering 15–20% of collective communication time.

Gang scheduling. In distributed training, all ranks of a job must progress together. If rank 3's GPU context gets preempted while ranks 0–7 are in the middle of an AllReduce, the barrier stall propagates across the entire job. A drm_sched policy that understands gang membership — scheduling all members of a gang together or not at all — eliminates these stalls entirely.

Key files

drivers/gpu/drm/scheduler/sched_main.c · drivers/gpu/drm/scheduler/sched_fence.c · include/drm/gpu_scheduler.h — look for drm_gpu_scheduler struct and drm_sched_entity priority fields.

03

IOMMU / DMA — P2PDMA & IOVA Caching

★★ High Impact

In NVLink or InfiniBand + GPU clusters, the bottleneck is PCIe peer-to-peer DMA. When GPU 0 wants to write to GPU 1's memory, the kernel currently routes it through the CPU's memory controller on many platforms — because the IOMMU's IOVA allocator doesn't understand P2P mappings between devices that aren't memory controllers.

The second, less obvious problem is IOVA allocator scalability. The per-CPU IOVA magazine allocator in drivers/iommu/iova.c was designed for 4K-page workloads (web servers, databases). AI workloads map 2MB to 1GB regions — hundreds of times larger. At 256 CPU cores and 8 GPUs, the magazine allocator becomes a spinlock contention point that adds microseconds of setup overhead to every DMA operation.

P2P P2PDMA (drivers/pci/p2pdma.c) — extend to support multi-hop P2P across PCIe switches without bouncing through host memory. AMD's XGMI fabric and NVIDIA's NVLink already bypass the CPU; P2PDMA needs to express this topology to the kernel's DMA layer.
IOVA IOVA cache rework — replace the magazine allocator with per-NUMA radix trees sized for large-region mappings. Benchmark target: DMA setup overhead for a 512MB region should be under 1µs on a 256-core system (currently 8–40µs with lock contention).
VFIO VFIO mediated devices — for bare-metal inference servers, bypassing the kernel DRM layer entirely via VFIO with mediated device assignment reduces CUDA kernel launch latency from ~8µs to ~2µs by eliminating the ioctl dispatch chain.

04

CPU Scheduler — EEVDF, isolcpus & SCHED_DEADLINE

★ Significant

PyTorch DataLoader workers, NCCL communication threads, and CUDA stream callbacks all compete for CPU time. The standard CFS scheduler has millisecond-scale wakeup latency — catastrophic when a GPU kernel is sitting idle, waiting for a CPU-side fence signal to launch the next wave.

isol isolcpus= + nohz_full= — dedicate a set of CPU cores entirely to GPU communication threads. With no OS tick and no competing tasks, wakeup latency drops from 1–3ms to under 200µs. NCCL proxy threads benefit most, since they sit in a tight polling loop waiting for DMA completion events.
EEVDF EEVDF (Earliest Eligible Virtual Deadline First) — merged in Linux 6.6, EEVDF is strictly better than CFS for mixed I/O + compute workloads. Its latency-nice tuning knob, exposed via sched_attr, allows marking GPU callback threads as latency-sensitive without full real-time priority.
DL SCHED_DEADLINE for NCCL proxy threads — using the deadline scheduler class with runtime=500µs, period=1ms gives the collective communication thread the determinism it needs without requiring CAP_SYS_NICE escalation in production containers.

05

Network Stack — XDP, RDMA & Kernel Bypass

★ Significant

For multi-node training (anything beyond a single machine), the network subsystem becomes the bottleneck. NCCL over TCP is wasteful; even RDMA verbs carry unnecessary kernel overhead. The network path has three layers of optimization available.

XDP (eXpress Data Path) keeps the data path entirely in the NIC driver, bypassing the socket layer completely. Combined with AF_XDP sockets, you get performance within 10–15% of DPDK — but with full kernel integration, no separate process, and compatibility with standard networking tools. Moving NCCL's transport to XDP-based packet processing on InfiniBand or RoCE eliminates every sk_buff allocation in the hot path.

io_uring + fixed buffers for RDMA — using io_uring's registered buffers (pinned and mapped once at setup time) for gradient exchange eliminates the per-operation get_user_pages() overhead that dominates small-message RDMA latency. For NCCL's ring-AllReduce, where millions of small ibv_post_send calls are made per training step, this is measurable.

GRO (Generic Receive Offload) tuning — large receive coalescing is tuned for web traffic (small, irregular packets). AI gradient traffic is large, bursty, and periodic — the GRO engine needs gro_flush_timeout tuning and per-flow coalescing hints to handle the predictable burst pattern efficiently.

06

VFS / Block Layer — `io_uring` & Dataset I/O

★ Important

During training, the data pipeline (reading tokenized sequences, TFRecords, Parquet files) runs on the same machine as the GPU workload. The kernel's block layer and VFS matter more than they appear: a stalled DataLoader can drop GPU utilization from 95% to 60% in seconds.

io_uring io_uring with fixed files + registered buffers — eliminates every copy_to_user(), every get_user_pages(), and every context switch in the hot path. A well-tuned io_uring reader with a submission queue of 4096 entries can saturate 4× NVMe drives at 25 GB/s using a single thread.
sched mq-deadline vs none I/O scheduler — for NVMe, the none scheduler (pass-through to multiqueue) with blk-wbt writeback throttling disabled gives the lowest tail latency for random-access reads. mq-deadline makes sense only when mixing checkpoint writes (sequential) with sample reads (random).
DAX DAX (Direct Access) for persistent memory — for CXL-attached DRAM or Optane, DAX bypasses the page cache entirely. The kernel maps PMEM directly into the process VA space, making checkpoint saves to PMEM as fast as memcpy() — transformative for frequent checkpoint intervals in large training runs.

07

Deep cut: TLB Shootdown Storms

★ Important (often overlooked)

This is the issue that rarely appears in tuning guides but absolutely destroys performance on large-core systems. TLB shootdown storms: when the GPU driver calls munmap() on a large pinned buffer — say, model weights being swapped between two training jobs — the kernel must send inter-processor interrupts (IPIs) to all other CPUs to invalidate their TLB entries for that address range.

On a 256-core EPYC system, a single munmap() on a 40GB buffer triggers 255 TLB invalidation IPIs simultaneously. Every CPU stalls for 2–5µs. Multiply by thousands of mappings per training step.

The fix is architectural and lives in arch/x86/mm/tlb.c. Three mechanisms work together:

PCID PCID (Process Context IDs) with lazy TLB flushing — instead of flushing the entire TLB on context switch, PCID tags each entry with a process ID. CPUs that don't hold stale entries for a given PCID can skip the IPI entirely. Already partially implemented but not fully exploited by GPU drivers.
batch MMU notifier batching — mmu_notifier callbacks in the GPU driver currently trigger an immediate IPI broadcast on every page invalidation. Batching invalidations at the GPU side — collecting them over a 50–100µs window and sending a single broadcast — reduces IPI frequency by 100–1000× without visible impact on correctness.
huge Huge page pinning — pinning model weights as 2MB huge pages instead of 4K pages reduces TLB entry count by 512× for the same buffer. Fewer entries means fewer invalidations and a smaller shootdown blast radius. Achievable via madvise(ptr, size, MADV_HUGEPAGE) before the first GPU mapping.

08

Summary & prioritization

If you can only change one thing: complete HMM + ATS + PASID support across the IOMMU layer and your GPU driver. Everything else is incremental. After that, the drm_sched gang scheduling policy and IOVA allocator scalability are the next highest-leverage changes.

Subsystem	Key change	Source location	Impact
HMM / SVM / ATS	Unified VA space, zero-copy GPU↔CPU	`mm/hmm.c` · `drivers/iommu/`	★★ Highest
drm_sched	Gang scheduling, DAG fences, preemption	`drivers/gpu/drm/scheduler/`	★★ Highest
IOMMU / P2PDMA	Multi-hop P2P, IOVA large-region allocator	`drivers/pci/p2pdma.c` · `drivers/iommu/iova.c`	★★ High
CPU scheduler	isolcpus, EEVDF latency-nice, SCHED_DEADLINE	`kernel/sched/fair.c` · `kernel/sched/deadline.c`	★ Significant
Network / RDMA	XDP transport, io_uring fixed buffers, GRO tuning	`net/core/dev.c` · `drivers/infiniband/`	★ Significant
VFS / Block	io_uring fixed files, NVMe-mq none, DAX	`fs/io_uring.c` · `block/blk-mq.c`	★ Important
TLB / MMU notifier	PCID lazy flush, IPI batching, huge page pinning	`arch/x86/mm/tlb.c` · `mm/mmu_notifier.c`	★ Important

Bottom line

The Linux kernel is not passively neutral in AI/GPU performance — it actively participates in every tensor operation, every gradient exchange, every checkpoint write. Understanding these seven subsystems is the difference between a cluster running at 40% MFU and one running at 70%.

Tuning the Linux Kernel for AI & GPU Servers

The invisible bottleneck

Memory Management: HMM, SVM & ATS

DRM GPU Scheduler — drm_sched

IOMMU / DMA — P2PDMA & IOVA Caching

CPU Scheduler — EEVDF, isolcpus & SCHED_DEADLINE

Network Stack — XDP, RDMA & Kernel Bypass

VFS / Block Layer — io_uring & Dataset I/O

Deep cut: TLB Shootdown Storms

Summary & prioritization

DRM GPU Scheduler — `drm_sched`

VFS / Block Layer — `io_uring` & Dataset I/O