1. The problem: the GPU is idle, but the root cause is invisible
A training loop does not fail only when a GPU kernel fails. It also fails when a checkpoint write stalls, a dataset shard read hits an NVMe reset, a DMA map falls back to a bounce buffer, or an IOMMU fault forces recovery.
From the model’s perspective, all of these look like the same thing: GPU utilization drops. From the kernel’s perspective, they are spread across separate subsystems: block, NVMe, filesystem, DMA mapping, IOMMU, PCIe, and sometimes GPU memory management.
2. Failure modes that matter in AI clusters
NVMe timeout/reset
One stuck command can trigger aborts and controller reset handling. Training sees this as a multi-hundred-millisecond or second-scale starvation event.
p999 read latency
Average I/O looks fine while p999 latency destroys input pipeline smoothness. GPU compute bubbles appear periodically.
Checkpoint write stalls
Large periodic writes interact with writeback, journaling, queue depth, and firmware garbage collection. The next training stage blocks.
DMA/IOMMU faults
Large pinned buffers, GPUDirect paths, NVMe DMA, and IOVA pressure can create mapping latency, bounce buffering, or hard faults.
nvme nvme0: I/O 4096 QID 7 timeout, aborting
nvme nvme0: resetting controller
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read NO_PASID] Request device [5e:00.0] fault addr 0x...
blk_update_request: I/O error, dev nvme0n1, sector 18874368 op 0x0:(READ)
These logs are useful, but not enough. They identify a symptom at one layer. They do not preserve workload identity across filesystem, block, device, and DMA subsystems.
3. Proposed architecture: request lineage for AI I/O
The fix is not to make the kernel “understand PyTorch.” The fix is smaller and more upstream-friendly: let user space attach a durable request lineage tag to I/O, then propagate that tag through blk-mq, NVMe, DMA mapping, and filesystem integrity events.
| Layer | Patch | New answer enabled |
|---|---|---|
| io_uring / VFS | Accept and attach AI request tag | Which checkpoint/shard/batch issued this I/O? |
| blk-mq | Propagate tag through bio/request completion | Where did latency accumulate? |
| NVMe | Timeout/reset attribution with tag | Which workload was hit by controller recovery? |
| DMA/IOMMU | Map latency, bounce, and fault tracepoints | Was the fast path actually used? |
| Filesystem | Checkpoint integrity hook | Was the checkpoint complete and correct? |
4. Patch 1 — AI-aware I/O tags in blk-mq
Linux already carries rich metadata through the block layer, but it does not carry a stable application-level lineage identifier. We add a compact u64 ai_tag to block requests and expose it through a controlled io_uring path.
Kernel patch sketch
// include/linux/blk_types.h
struct bio {
...
u64 bi_ai_tag;
};
// include/linux/blk-mq.h or include/linux/blkdev.h
struct request {
...
u64 ai_tag;
};
// block/blk-mq.c
static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
{
...
rq->ai_tag = bio->bi_ai_tag;
}
Tracepoint
// include/trace/events/block.h
TRACE_EVENT(block_ai_rq_complete,
TP_PROTO(struct request *rq, u64 latency_ns),
TP_ARGS(rq, latency_ns),
TP_STRUCT__entry(
__field(u64, ai_tag)
__field(dev_t, dev)
__field(sector_t, sector)
__field(unsigned int, nr_bytes)
__field(u64, latency_ns)
),
TP_fast_assign(
__entry->ai_tag = rq->ai_tag;
__entry->dev = disk_devt(rq->q->disk);
__entry->sector = blk_rq_pos(rq);
__entry->nr_bytes = blk_rq_bytes(rq);
__entry->latency_ns = latency_ns;
),
TP_printk("ai_tag=%llu dev=%d:%d sector=%llu bytes=%u latency_ns=%llu",
__entry->ai_tag, MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long long)__entry->sector, __entry->nr_bytes,
__entry->latency_ns)
);
5. Patch 2 — NVMe timeout and reset attribution
NVMe timeouts are one of the worst cluster incidents because they can create long GPU starvation intervals while still looking like “storage noise” in logs. The fix is to attach the propagated tag to timeout, abort, and reset paths.
// drivers/nvme/host/core.c
static enum blk_eh_timer_return nvme_timeout(struct request *req)
{
struct nvme_request *nr = nvme_req(req);
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
trace_nvme_ai_timeout(ctrl->instance,
nvmeq->qid,
req->tag,
req->ai_tag,
blk_rq_bytes(req));
dev_warn(ctrl->device,
"I/O timeout: qid=%u tag=%d ai_tag=%llu bytes=%u\n",
nvmeq->qid, req->tag,
req->ai_tag, blk_rq_bytes(req));
return BLK_EH_RESET_TIMER;
}
// include/trace/events/nvme.h
TRACE_EVENT(nvme_ai_timeout,
TP_PROTO(int ctrl, int qid, int rq_tag, u64 ai_tag, unsigned int bytes),
TP_ARGS(ctrl, qid, rq_tag, ai_tag, bytes),
TP_STRUCT__entry(
__field(int, ctrl)
__field(int, qid)
__field(int, rq_tag)
__field(u64, ai_tag)
__field(unsigned int, bytes)
),
TP_fast_assign(
__entry->ctrl = ctrl;
__entry->qid = qid;
__entry->rq_tag = rq_tag;
__entry->ai_tag = ai_tag;
__entry->bytes = bytes;
),
TP_printk("nvme%d qid=%d rq_tag=%d ai_tag=%llu bytes=%u",
__entry->ctrl, __entry->qid, __entry->rq_tag,
__entry->ai_tag, __entry->bytes)
);
This does not magically prevent resets. It makes resets attributable. That is the difference between “training was slow last night” and “checkpoint shard 17 on nvme2n1 repeatedly triggered controller recovery during step 80,042.”
6. Patch 3 — DMA/IOMMU visibility for the hidden fast path
GPU clusters rely on the assumption that DMA is direct and efficient. But when IOVA allocation is slow, mappings fragment, or traffic falls back to a bounce buffer, user space rarely sees the reason. This patch adds tracepoints around DMA map/unmap latency and fallback behavior.
// kernel/dma/mapping.c
dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size,
enum dma_data_direction dir,
unsigned long attrs)
{
u64 start_ns = ktime_get_ns();
dma_addr_t dma_addr;
dma_addr = __dma_map_page_attrs(dev, page, offset, size, dir, attrs);
trace_dma_ai_map_latency(dev_name(dev), size, dir,
ktime_get_ns() - start_ns,
is_swiotlb_buffer(dev, dma_addr));
return dma_addr;
}
// include/trace/events/dma.h
TRACE_EVENT(dma_ai_map_latency,
TP_PROTO(const char *dev, size_t size, int dir, u64 latency_ns, bool bounced),
TP_ARGS(dev, size, dir, latency_ns, bounced),
TP_STRUCT__entry(
__string(dev, dev)
__field(size_t, size)
__field(int, dir)
__field(u64, latency_ns)
__field(bool, bounced)
),
TP_fast_assign(
__assign_str(dev, dev);
__entry->size = size;
__entry->dir = dir;
__entry->latency_ns = latency_ns;
__entry->bounced = bounced;
),
TP_printk("dev=%s size=%zu dir=%d latency_ns=%llu bounced=%d",
__get_str(dev), __entry->size, __entry->dir,
__entry->latency_ns, __entry->bounced)
);
7. Patch 4 — checkpoint integrity hooks
Checkpoint files deserve first-class observability because they are both large and correctness-critical. A partial, stale, or corrupted checkpoint can waste hours of cluster time or produce mysterious divergence after resume.
The kernel should not understand model formats, but it can provide a generic integrity hook that user space opts into with an xattr or open flag.
# userspace convention
setfattr -n user.ai.checkpoint -v "sha256:enabled" /checkpoints/run42/step80000.pt
// fs/read_write.c — sketch only
static inline bool file_ai_checkpoint_enabled(struct file *file)
{
return file_inode(file)->i_flags & S_AI_CHECKPOINT;
}
ssize_t vfs_write(struct file *file, const char __user *buf,
size_t count, loff_t *pos)
{
ssize_t ret;
u64 start = ktime_get_ns();
ret = __vfs_write(file, buf, count, pos);
if (ret > 0 && file_ai_checkpoint_enabled(file))
trace_ai_checkpoint_write(file, *pos, ret,
ktime_get_ns() - start);
return ret;
}
A production version would avoid hashing in hot VFS paths unless explicitly enabled. The first patch should trace checkpoint write extents and completion ordering. User space can compute cryptographic hashes asynchronously.
8. Matching eBPF tooling: ai-io-trace
The kernel patches expose tracepoints. The user-space collector makes them valuable by joining them into one timeline.
BPF program sketch
// bpf/ai_io_trace.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
char LICENSE[] SEC("license") = "GPL";
struct ai_event {
u64 ts_ns;
u64 ai_tag;
u32 type;
u32 pid;
u64 latency_ns;
u64 bytes;
char dev[32];
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 24);
} events SEC(".maps");
enum {
EV_BLOCK_COMPLETE = 1,
EV_NVME_TIMEOUT = 2,
EV_DMA_MAP = 3,
EV_DMA_BOUNCE = 4,
EV_CKPT_WRITE = 5,
};
SEC("tracepoint/block/block_ai_rq_complete")
int handle_block_complete(void *ctx)
{
struct ai_event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->ts_ns = bpf_ktime_get_ns();
e->type = EV_BLOCK_COMPLETE;
e->pid = bpf_get_current_pid_tgid() >> 32;
// Read ai_tag, latency_ns, bytes from tracepoint context in real code.
bpf_ringbuf_submit(e, 0);
return 0;
}
SEC("tracepoint/nvme/nvme_ai_timeout")
int handle_nvme_timeout(void *ctx)
{
struct ai_event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->ts_ns = bpf_ktime_get_ns();
e->type = EV_NVME_TIMEOUT;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_ringbuf_submit(e, 0);
return 0;
}
SEC("tracepoint/dma/dma_ai_map_latency")
int handle_dma_map(void *ctx)
{
struct ai_event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->ts_ns = bpf_ktime_get_ns();
e->type = EV_DMA_MAP;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_ringbuf_submit(e, 0);
return 0;
}
User-space collector sketch
# tools/ai_io_trace.py
from bcc import BPF
from collections import defaultdict, deque
import json, time, subprocess
b = BPF(src_file="bpf/ai_io_trace.bpf.c")
recent = defaultdict(lambda: deque(maxlen=2048))
def gpu_util_snapshot():
try:
out = subprocess.check_output([
"nvidia-smi",
"--query-gpu=timestamp,index,utilization.gpu,utilization.memory",
"--format=csv,noheader,nounits"
], text=True, timeout=1)
return out.strip()
except Exception:
return "gpu_metrics_unavailable"
def on_event(cpu, data, size):
e = b["events"].event(data)
item = {
"ts_ns": e.ts_ns,
"ai_tag": e.ai_tag,
"type": e.type,
"pid": e.pid,
"latency_ns": e.latency_ns,
"bytes": e.bytes,
"gpu": gpu_util_snapshot(),
}
recent[e.ai_tag].append(item)
print(json.dumps(item))
b["events"].open_ring_buffer(on_event)
while True:
b.ring_buffer_poll(timeout=100)
Example output
{"ai_tag": 80042, "type": "NVME_TIMEOUT", "dev": "nvme2n1", "latency_ms": 1240.3,
"gpu": "2026/04/28 22:15:04, 0, 12, 18"}
[ai-io-trace] GPU stall attribution
step: 80042
likely cause: NVMe controller timeout
device: nvme2n1
affected path: checkpoint write
correlated DMA symptom: no bounce; map latency normal
action: inspect smart-log/error-log, firmware, thermal throttling, PCIe AER
9. What the final report should look like
The win is not a prettier trace log. The win is converting a distributed failure into a root-cause narrative.
AI I/O Stall Report
===================
Window: 2026-04-28T22:14:40Z → 2026-04-28T22:15:10Z
GPU: 0, 1, 2, 3
Symptom: SM utilization fell from 94% to 17% for 3.8s
Primary cause:
NVMe timeout on nvme2n1, queue 7, request tag 4096
Workload lineage:
ai_tag: 80042
semantic label: checkpoint_write/run42/step80000/shard17
Kernel path:
VFS write → XFS extent writeback → blk-mq dispatch → NVMe timeout → reset recovery
Secondary signals:
DMA map latency normal
No SWIOTLB bounce detected
No IOMMU fault detected
Recommended next actions:
1. Check nvme smart-log and error-log for nvme2n1
2. Inspect PCIe AER for the device/root port
3. Compare p999 latency during checkpoint windows
4. Consider separating checkpoint device from dataset read device
10. How to make this upstream-plausible
Do not pitch this as “the kernel should know about AI.” Pitch it as a general, opt-in request attribution mechanism motivated by AI clusters.
Good framing
Generic I/O lineage tags, tracepoints, and opt-in integrity metadata for high-value files.
Bad framing
Hard-code PyTorch, NCCL, checkpoint formats, model names, or GPU vendor assumptions into core kernel paths.
The first mergeable slice is likely not the full patchset. A realistic sequence is:
- Add tracepoints only, without new behavior.
- Add generic request correlation tags behind feature flags.
- Add DMA/IOMMU latency and bounce observability.
- Build user-space tooling that proves real operational value.
Conclusion: AI reliability needs kernel-level causality
The next frontier in AI infrastructure is not only faster GPUs. It is making the non-GPU path explainable when the GPU is waiting.
Storage errors, controller resets, DMA fallbacks, and checkpoint integrity problems are not edge cases. They are production realities. Today, they are difficult to attribute because every subsystem logs its local truth while the full request lineage is lost.
That turns “GPU utilization dropped” into a precise engineering answer: which request, which device, which queue, which mapping path, which checkpoint, and which kernel subsystem caused the stall.