A systems engineer's guide to understanding why the kernel is the most critical — and most overlooked — layer in modern AI infrastructure, and how to write targeted patches to expose its hidden failure modes.
When an AI training run stalls, the first places engineers look are the model architecture, the optimizer configuration, or the networking stack. Rarely do they look at the operating system kernel. Yet every GPU kernel launch, every tensor transferred across an NVLink fabric, every checkpoint flushed to NVMe, and every NCCL all-reduce across an InfiniBand network — all of it transits the Linux kernel at some point.
The kernel does not merely pass data through. It makes active decisions about memory placement, interrupt routing, CPU affinity, DMA mapping, IOMMU translation, and process scheduling. These decisions can silently shave 40% off effective GPU utilization or produce intermittent gradient corruption under specific error conditions that go unreported for hours.
A 1ms scheduling jitter in a single GPU worker thread, multiplied across 1,024 GPUs performing a synchronized all-reduce, produces a collective stall that looks exactly like a slow network. It isn't. The network is fine. The kernel scheduler is the culprit — and without kernel-level instrumentation, you will spend days chasing the wrong thing.
The data path for a single forward pass on a distributed model touches an extraordinary number of kernel subsystems in sequence:
Each stage has its own queue depth, its own error reporting path, and its own latency budget. When any one degrades silently — no crash, no log entry, just slower throughput — the only path to diagnosis is instrumentation. That instrumentation begins with kernel tracepoints, and tracepoints are written by people who contribute to the kernel. That is why this work matters at a fundamental level.
Let's walk through each subsystem in detail, explain the specific failure modes, and describe the patches that would expose them.
The Direct Rendering Manager subsystem is where GPU commands are born. Every tensor operation, matrix multiplication, and attention kernel begins life as a job submitted to the DRM scheduler — drm_sched. This scheduler manages per-ring, per-priority run queues of GPU jobs, feeds them to the GPU hardware, and handles completion signaling. Understanding this path is the foundation for any GPU observability work.
ioctl, placed into the scheduler's priority-ordered ring queue, and dispatched when the GPU signals readiness via a DMA fence. The critical measurement is the gap between job enqueue and actual GPU dispatch — the submission latency. Currently, this gap is invisible without explicit tracepoints.
When a GPU is saturated, the DRM scheduler does not immediately signal userspace. The drm_sched_job_arm() call can block waiting for an available scheduler slot. The training framework above sees only that its CUDA or ROCm stream hasn't completed yet. There is no counter, no log line, and no histogram of how long jobs waited in the scheduler's ring buffer. This is the gap observability patches close.
The current tracepoints in drm_sched cover job submission and completion, but they do not record the timestamp at which a job was enqueued versus when it was actually dispatched. A patch adding an enqueue_ts field to the job structure and emitting it in a new tracepoint would allow exact measurement of scheduler wait time under load — a metric that no existing tool can provide.
Memory pressure causes TTM (Translation Table Manager) buffer object eviction — ttm_bo_evict — in the middle of active training. The GPU stalls while tensors are paged back in from system memory. This event fires no userspace signal. It appears only as a dropped GPU utilization spike with no associated CPU event in standard profilers. Without a tracepoint in the eviction path that includes the affected job context, it is invisible.
A kernel tracepoint is a lightweight, statically-compiled hook that fires at zero overhead when not enabled, and emits structured events to the tracefs ring buffer when activated via echo 1 > /sys/kernel/debug/tracing/events/drm_sched/enable. Here is what a scheduler wait-time tracepoint looks like:
/* In include/trace/events/drm_sched.h */
TRACE_EVENT(drm_sched_job_dispatch,
TP_PROTO(struct drm_sched_job *job),
TP_ARGS(job),
TP_STRUCT__entry(
__field(uint64_t, id)
__field(ktime_t, submit_ts)
__field(ktime_t, dispatch_ts)
__string(sched_name, job->sched->name)
),
TP_fast_assign(
__entry->id = job->id;
__entry->submit_ts = job->submit_ts; /* set at arm() */
__entry->dispatch_ts = ktime_get();
__assign_str(sched_name, job->sched->name);
),
TP_printk("sched=%s job=%llu wait_us=%lld",
__get_str(sched_name),
__entry->id,
ktime_us_delta(__entry->dispatch_ts, __entry->submit_ts))
);
With this tracepoint, a bpftrace one-liner can build a real-time latency histogram with no reboot required:
$ bpftrace -e '
tracepoint:drm_sched:drm_sched_job_dispatch {
@wait_us = hist(args->wait_us);
}'
That histogram is the missing telemetry that distinguishes "GPU is compute-bound" from "GPU is scheduler-bound" — two situations that look identical from userspace.
Jobs in the DRM scheduler can declare dependencies on other jobs via DMA fences — in-kernel synchronization primitives. A job will not be dispatched until all its fence dependencies have signaled. In complex training graphs with many concurrent streams, long dependency chains can form, causing later jobs to wait for earlier ones even when the GPU is nominally idle. Tracing the depth and wait time of these chains requires instrumentation at dma_fence_wait() call sites in the DRM layer — another high-value patch that doesn't yet exist in mainline.
Distributed training is, at its core, a synchronization problem. Every GPU in a cluster must exchange gradients at the end of each backward pass. For a 70B parameter model trained on 512 GPUs, this all-reduce operation moves roughly 140GB of data across the network in every step. This happens via RDMA — Remote Direct Memory Access — implemented through InfiniBand hardware and the Linux kernel's ib_core stack. NCCL, the primary collective communications library, runs directly on top of this kernel infrastructure.
RDMA errors come in two flavors: fatal (the QP transitions to an error state and training crashes) and recoverable (completions with non-success work completion status codes, transparently retried by NCCL). The recoverable ones are often more dangerous in practice, because they inflate the tail latency of all-reduce operations without producing any signal that makes it into training logs.
A single QP experiencing frequent IB_WC_RETRY_EXC_ERR completions can delay an entire 1,024-GPU all-reduce because the collective barrier is synchronous — every rank must complete before any rank can proceed. The slowest QP in the cluster sets the floor for the entire training throughput.
A single intermittently-flapping InfiniBand port on one switch causes IB_WC_REM_INV_REQ_ERR completions on 64 QPs across 8 nodes. NCCL retries internally and eventually succeeds. The all-reduce for affected steps takes 3× longer than normal. The training log shows nothing unusual. GPU utilization shows a periodic dip every ~90 steps. Without a tracepoint on non-success CQ completions — one that captures the QP GID, error type, and associated training process — this failure mode takes days to localize to a specific switch port.
A Completion Queue has a fixed depth, set at creation time by the application. If the application doesn't poll the CQ fast enough — for example, because a GPU worker thread was descheduled at a critical moment, or because a cgroup CPU quota was hit — the CQ fills to capacity. When new completions arrive for a full CQ, InfiniBand hardware handles this by transitioning the associated QP to an error state and dropping subsequent completions silently.
This is not a crash. The operation simply never completes from the application's perspective. The QP must be destroyed and re-established, which in NCCL triggers a full communicator teardown and restart — an extremely expensive operation that appears as a multi-minute training pause.
Currently, there is no mainline tracepoint for CQ overflow events. Adding one to drivers/infiniband/core/cq.c would be immediately valuable to every team running GPU clusters on InfiniBand fabric. The patch would be modest in size and scope:
/* drivers/infiniband/core/cq.c — proposed tracepoint */
static void ib_cq_notify_overflow(struct ib_cq *cq)
{
/*
* CQ has reached capacity. The QP will transition to error
* state. Emit a tracepoint before this happens so operators
* can correlate with application-level timeouts.
*/
trace_ib_cq_overflow(
cq->device->name, /* which RDMA device */
cq->cqe, /* configured CQ depth */
cq->cqe_used, /* entries at overflow */
current->comm /* process name */
);
}
Beyond error events, there is a need for normal-path latency measurement at the QP level. The ib_post_send call places a work request on the send queue, and completion arrives later via ib_poll_cq. The round-trip time between these two calls — the RDMA operation latency — is the metric NCCL tuning ultimately depends on. Currently, measuring this requires userspace instrumentation or hardware-level Infiniband diagnostics tools that are not always available. A tracepoint pair at ib_post_send and ib_poll_cq that share a work request identifier would allow exact per-operation latency measurement from bpftrace without any application-level changes.
Modern GPU servers are NUMA (Non-Uniform Memory Access) machines. A typical 8-GPU server has 2 CPU sockets, 2 NUMA nodes, and each group of 4 GPUs is physically closer — lower latency PCIe path — to one NUMA node than the other. The kernel's memory allocator and NUMA balancer make continuous decisions about where tensors, driver buffers, and kernel data structures live in physical memory. These decisions have direct, measurable effects on H2D/D2H bandwidth and GPU driver latency.
GPU drivers — particularly amdgpu, xe, and the open NVIDIA kernel modules — aggressively request 2MB transparent huge pages (THP) for GPU memory mappings. Huge pages reduce TLB pressure during DMA and meaningfully improve H2D/D2H throughput by reducing the number of IOMMU page table entries that must be walked per DMA operation.
The problem is that huge page allocation can fail silently when physical memory is fragmented. When a huge page allocation fails, the driver falls back to composing the mapping from base 4KB pages. This fallback is not logged anywhere. There is no counter in /proc/meminfo that tracks driver-specific huge page allocation failures separately from general allocation failures. GPU memory bandwidth for subsequent operations may drop 15–30% due to increased TLB miss rates, and the engineer watching GPU utilization sees only a slight, unexplained throughput decrease.
Memory compaction — the kernel mechanism that defragments physical memory to satisfy huge page requests — runs synchronously at allocation time and can hold the zone lock for tens of milliseconds. If this happens while a GPU driver thread is trying to map memory for an in-flight DMA operation, the DMA stalls waiting for the lock. The mm_compaction_begin and mm_compaction_end tracepoints already exist, but they carry no caller context. Adding the requesting process name and allocation flags to these tracepoints is a minimal, high-value patch.
The most dangerous interaction in this subsystem involves the NUMA balancer's page migration and active DMA operations. When the balancer decides to migrate a page that the GPU driver has registered for DMA, the IOMMU mapping pointing to that page's physical address becomes stale the moment migration completes. Modern kernels handle this through mmu_notifiers — callbacks that drivers register to be informed of page table changes before they happen.
The sequence is: balancer selects page for migration → calls mmu_notifier_invalidate_range_start() → GPU driver receives callback → driver must quiesce any in-flight DMA to that page → migration proceeds. If the driver's notifier handler is slow, if it races with an interrupt, or if there is any window between notification delivery and DMA quiescence, a GPU read or write can target the old physical address after the page has moved. The result is silent data corruption — wrong gradient values with no error signal from either hardware or software.
Adding a tracepoint to mm/migrate.c that fires when page migration is attempted on a page with active mmu_notifiers — and records the time spent blocking in mmu_notifier_invalidate_range_start() — would expose the frequency and duration of this race in production environments.
/* mm/migrate.c — proposed mmu_notifier wait tracepoint */
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage,
struct page *page, ...)
{
ktime_t t0 = ktime_get();
mmu_notifier_invalidate_range_start(&range);
trace_mm_migrate_mmu_notifier_wait(
page_to_pfn(page),
ktime_us_delta(ktime_get(), t0)
);
/* ... rest of migration ... */
}
Every H2D (host-to-device) and D2H (device-to-host) memory transfer traverses the PCIe bus. For large language model training, this includes loading mini-batches from CPU pinned memory, moving optimizer state during gradient accumulation, and transferring activations during pipeline-parallel training across nodes. The IOMMU sits between the GPU and system memory, translating GPU virtual addresses to physical addresses for every DMA transaction. Neither PCIe errors nor IOMMU performance are adequately exposed by current mainline tracepoints.
A PCIe link generating correctable errors is not failing today — it is demonstrating that it will fail. Correctable errors (typically ECRC — End-to-End Cyclic Redundancy Check — or receiver errors) are automatically retried by PCIe hardware and produce no data corruption. They do, however, add latency to every affected DMA operation, because the PCIe transaction must be retransmitted.
A GPU slot generating 10,000 correctable errors per hour will appear to work perfectly in all application-level metrics while silently throttling H2D bandwidth by 5–15%. In a 1,000-GPU cluster, statistically several GPUs will be in exactly this state at any given time, degrading overall training throughput in a way that is impossible to attribute without hardware-level error counters correlated with DMA performance metrics.
The pci_aer_stats counters exist in sysfs, but they are polled — not pushed as events. There is no tracepoint for AER correctable error events. This means no automatic correlation with DMA latency spikes, no alert capability without a polling daemon, and no way to build histograms of error rate over time with low overhead. A tracepoint in drivers/pci/pcie/aer.c firing on each correctable error, carrying device BDF (Bus:Device.Function), error type, and timestamp, would be an immediately impactful addition to mainline.
PCIe correctable error rates follow a characteristic pattern before a GPU or link failure: rates gradually increase over days or weeks before a non-correctable error causes an outage. With kernel tracepoints, that escalation is automatically visible in any tracing system. Without them, operators only find out when the training job crashes.
The IOMMU translates GPU virtual addresses to physical addresses using a multi-level page table, with a hardware TLB cache to speed up repeated translations. When GPU memory mappings are created and destroyed rapidly — as happens during dynamic computation graph execution in frameworks like PyTorch eager mode — the IOMMU TLB is frequently invalidated. Each invalidation is a serializing operation that briefly pauses all DMA from the affected device.
The iommu_iotlb_gather mechanism batches TLB invalidations to amortize this cost, but the batching parameters — batch size limits, flush triggering thresholds — are opaque. There are currently no tracepoints that record batch sizes, the number of TLB entries flushed per batch, or the duration of individual flush operations. Adding this instrumentation to drivers/iommu/iommu.c would allow workload-specific tuning of IOMMU parameters and direct attribution of DMA throughput losses to IOMMU contention.
/* drivers/iommu/iommu.c — proposed iotlb gather tracepoint */
void iommu_iotlb_sync(struct iommu_domain *domain,
struct iommu_iotlb_gather *iotlb_gather)
{
ktime_t t0 = ktime_get();
domain->ops->flush_iotlb_all(domain);
trace_iommu_iotlb_sync(
domain->type,
iotlb_gather->pgsize, /* page granularity flushed */
iotlb_gather->start, /* VA range start */
iotlb_gather->end, /* VA range end */
ktime_us_delta(ktime_get(), t0)
);
}
GPU driver code runs in a mix of kernel threads and userspace threads that the Linux CPU scheduler manages identically to any other workload. The problem is that these threads have very specific latency requirements: they must respond to GPU interrupts, poll completion queues, and submit new work within tight timing budgets to keep the GPU pipeline full. The Completely Fair Scheduler has no knowledge of these constraints, and its generic decisions — load balancing, NUMA placement, preemption — can devastate GPU utilization in ways that are extremely difficult to diagnose.
When the CFS load balancer decides to migrate a GPU driver kernel thread from NUMA node 0 to NUMA node 1, that thread now executes on CPUs physically farther from the GPU it is servicing. GPU MSI interrupts are typically pinned to a specific CPU affinity set at driver initialization — they do not follow the thread. The result is that interrupt delivery and interrupt processing happen on different NUMA nodes, adding inter-socket latency (typically 50–150ns on modern Intel and AMD platforms) to every single GPU command completion event.
At the scale of a busy GPU processing hundreds of thousands of small operations per second, this latency accumulation is significant. The sched_migrate_task tracepoint fires when migration occurs, but it carries no information about whether the migrated task is a GPU driver thread. A patch enriching this tracepoint with the task's CPU affinity mask and whether it is an interrupt-driven thread — or a complementary drm_sched_entity_migrate tracepoint in the DRM layer — would allow direct correlation of scheduler migrations with GPU performance regressions.
In Kubernetes-managed GPU clusters, GPU training pods run inside cgroups with CPU quota limits set by the resource scheduler. These limits are reasonable from a cluster utilization perspective, but they interact catastrophically with distributed training synchronization. If a NCCL communication thread hits its CPU quota mid-all-reduce, it is suspended until the next quota period begins (typically 100ms later).
Every other GPU in the synchronized collective is now waiting at the all-reduce barrier for that throttled thread to resume. A 100ms CPU quota suspension becomes a 100ms all-reduce stall across potentially hundreds of GPUs. The affected training job's step time is dominated by this stall rather than by GPU compute, but the operator sees only "slow training" with no obvious cause in the application logs.
/* Detecting NCCL thread throttling with bpftrace */
$ bpftrace -e '
tracepoint:cgroup:cgroup_throttle_cpu {
$comm = str(comm);
if ($comm == "nccl" || $comm == "pt_data_worker"
|| strncmp($comm, "python", 6) == 0) {
@throttled_ms[pid, $comm] =
hist(args->throttled_time / 1000000);
printf("[%llu] throttled: pid=%d comm=%s\n",
nsecs, pid, $comm);
}
}'
A kernel patch adding a tracepoint to kernel/sched/fair.c at the cgroup throttling decision point — one that carries the thread's NUMA node, CPU affinity, and whether it holds any DRM or RDMA mutex — would allow automated detection of this failure mode in production monitoring systems. This is a self-contained, clearly-motivated patch with an obvious userspace use case, and is the kind of change the sched maintainers respond well to.
Model checkpointing is the process of periodically saving model weights and optimizer state to persistent storage so that training can resume from a known point after failure. For a 70B parameter model in BF16 precision, a single checkpoint is approximately 140GB. In a large training cluster performing checkpoints every 30 minutes across 64 nodes simultaneously, this is a 9TB synchronous write burst that competes with ongoing data loading, validation, and — if checkpoints write to shared NFS — other clusters on the same storage network.
io_uring is the modern Linux asynchronous I/O interface used by high-performance ML frameworks for checkpoint I/O. It uses submission and completion ring buffers to batch I/O operations with minimal syscall overhead. When the submission ring overflows — because I/O bandwidth is saturated — operations fail with -EBUSY and must be retried. These retries introduce stochastic latency that stalls the checkpoint writer thread and can delay training resumption after a failure by minutes.
Linux's page cache writeback mechanism — the kworker/flush kernel thread — periodically writes dirty pages to disk according to a set of heuristics based on dirty page ratio, elapsed time since last flush, and per-device I/O queue depth. When a large checkpoint write begins, it marks tens of gigabytes of pages dirty simultaneously. The writeback mechanism responds by aggressively flushing, competing with the checkpoint writer for block device queue slots.
Simultaneously, the balance_dirty_pages mechanism throttles the process dirtying pages if the dirty page ratio exceeds a threshold. For a checkpoint writer running inside a training process, this throttling appears as a stall in the checkpoint write loop — the process is suspended in the kernel waiting for dirty pages to be flushed before it can dirty more. From the training engineer's perspective, the GPU sits idle waiting for the checkpoint to complete.
The writeback_dirty_inode tracepoint fires when an inode is marked dirty but does not record the triggering process, cgroup, or allocation context. For GPU training jobs, knowing that a checkpoint write from a specific process caused a cascade of writeback pressure across a shared NFS mount would allow the distributed training framework to stagger checkpoint timing across nodes. A patch adding process and cgroup context to key writeback tracepoints would enable this analysis.
The io_uring submission queue ring has a fixed size, typically 128 or 256 entries, configured at ring creation. When a checkpoint writer submits more I/O operations than the ring can hold, operations fail with -EBUSY. The framework must retry, introducing latency. More critically, this overflow event is currently not exposed as a tracepoint — it is only detectable by polling the io_uring statistics file in /proc/<pid>/fdinfo/, which almost no monitoring system does.
The implication is that a training cluster could be experiencing thousands of io_uring overflows per checkpoint — each adding milliseconds of retry delay — with no signal in any standard monitoring dashboard. Adding a tracepoint at the overflow site in io_uring/io_uring.c would surface this failure mode in any tracing infrastructure:
/* io_uring/io_uring.c — proposed SQ overflow tracepoint */
static inline int io_get_sqe(struct io_ring_ctx *ctx,
const struct io_uring_sqe **sqe)
{
unsigned head, mask = ctx->sq_entries - 1;
if (unlikely(io_sqring_entries(ctx) == 0)) {
trace_io_uring_sq_overflow(
ctx,
ctx->sq_entries, /* ring capacity */
atomic_read(&ctx->sq_queued) /* current depth */
);
return -EBUSY;
}
/* ... normal submission path ... */
}
Beneath io_uring sits the multi-queue block layer (blk-mq), which manages hardware dispatch queues for NVMe devices. When all hardware queue slots are occupied, new requests stall in the software staging queue. The block_rq_insert and block_rq_issue tracepoints exist and capture this, but they do not carry the io_uring context that caused the insertion. A join between io_uring tracepoints and block layer tracepoints — achievable in eBPF without any kernel patch, but requiring the io_uring tracepoints to carry a request identifier — would close this observability gap completely.
The Linux kernel community is genuinely welcoming of observability patches — new tracepoints, enriched existing events, new /proc or debugfs counters — when they are well-motivated and clearly scoped. But the path from a useful idea to a merged patch has specific, non-negotiable requirements that differ significantly from application-level development. Understanding them before you write your first patch will save months of frustration.
printk() calls on production paths are almost always rejected. They add permanent noise to kernel logs and are non-zero cost. Tracepoints are idiomatic: zero cost when disabled, structured, type-safe, and composable with eBPF and ftrace. Every observability patch should use tracepoints.drivers/gpu/drm/scheduler/ needs review only from DRM maintainers. A patch that touches mm/, drivers/gpu/, and block/ simultaneously needs consensus from three separate maintainer trees and will sit in review limbo for months. Write patches that fit inside one subsystem's directory tree.ib_post_send or drm_sched_job_arm — must have verified overhead below ~5ns when disabled. The community expects perf bench numbers in the cover letter for any patch touching a path that executes more than ~1M times per second.printk on fast paths__field and __string, not sprintf)dri-devel@lists.freedesktop.org — DRM/GPU, all open driver worklinux-rdma@vger.kernel.org — InfiniBand, RoCE, RDMAlinux-mm@kvack.org — all memory management changeslinux-pci@vger.kernel.org — PCIe, AER, IOMMUio-uring@vger.kernel.org — io_uringlinux-kernel@vger.kernel.org — CC for anything touching kernel/sched/Subscribe to dri-devel and read three months of review threads. Find a tracepoint in drivers/gpu/drm/scheduler/ that is missing a field you need. Submit a patch adding that field. It will be small, obviously correct, easy to review, and directly beneficial to the GPU computing community. That is the right first contribution — small scope, clear motivation, one subsystem.
The Linux kernel is not plumbing that can be ignored when diagnosing AI cluster performance problems. It is the terrain on which every GPU job, every gradient synchronization, and every checkpoint write executes. The bottlenecks are real, they affect training throughput at scale every day, and the tools to measure them are already built into the kernel — they just need to be extended to cover the specific failure modes that matter to AI workloads.
In the next post, we will go hands-on: writing a complete tracepoint patch for the DRM scheduler, building a kernel with the patch applied, testing it with bpftrace, and formatting it for submission to the dri-devel mailing list. If you want to follow along, you'll need a kernel build environment with GCC or Clang, and ideally access to a machine with an AMD or Intel GPU with open-source driver support.
The patches described in this post are real opportunities. If you work on AI infrastructure and have access to GPU cluster hardware, you are in a position to contribute observability improvements that will benefit the entire community. The maintainers are looking for motivated contributors who can reproduce real problems and write clean, targeted fixes. That combination is rarer than it should be.
— Written · 25 min read · Linux kernel 6.x series