When GPUs Stall but Nothing Looks Wrong: Kernel-Level I/O Debugging for AI Clusters

1. The problem: the GPU is idle, but the root cause is invisible

A training loop does not fail only when a GPU kernel fails. It also fails when a checkpoint write stalls, a dataset shard read hits an NVMe reset, a DMA map falls back to a bounce buffer, or an IOMMU fault forces recovery.

From the model’s perspective, all of these look like the same thing: GPU utilization drops. From the kernel’s perspective, they are spread across separate subsystems: block, NVMe, filesystem, DMA mapping, IOMMU, PCIe, and sometimes GPU memory management.

Core bug: the Linux kernel has excellent local observability, but weak cross-layer lineage. We can see that an NVMe timeout happened. We cannot easily say which training step, checkpoint, dataset shard, CUDA copy, or GPU idle interval it caused.

2. Failure modes that matter in AI clusters

NVMe timeout/reset

One stuck command can trigger aborts and controller reset handling. Training sees this as a multi-hundred-millisecond or second-scale starvation event.

p999 read latency

Average I/O looks fine while p999 latency destroys input pipeline smoothness. GPU compute bubbles appear periodically.

Checkpoint write stalls

Large periodic writes interact with writeback, journaling, queue depth, and firmware garbage collection. The next training stage blocks.

DMA/IOMMU faults

Large pinned buffers, GPUDirect paths, NVMe DMA, and IOVA pressure can create mapping latency, bounce buffering, or hard faults.

nvme nvme0: I/O 4096 QID 7 timeout, aborting
nvme nvme0: resetting controller
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read NO_PASID] Request device [5e:00.0] fault addr 0x...
blk_update_request: I/O error, dev nvme0n1, sector 18874368 op 0x0:(READ)

These logs are useful, but not enough. They identify a symptom at one layer. They do not preserve workload identity across filesystem, block, device, and DMA subsystems.

3. Proposed architecture: request lineage for AI I/O

The fix is not to make the kernel “understand PyTorch.” The fix is smaller and more upstream-friendly: let user space attach a durable request lineage tag to I/O, then propagate that tag through blk-mq, NVMe, DMA mapping, and filesystem integrity events.

Layer	Patch	New answer enabled
io_uring / VFS	Accept and attach AI request tag	Which checkpoint/shard/batch issued this I/O?
blk-mq	Propagate tag through bio/request completion	Where did latency accumulate?
NVMe	Timeout/reset attribution with tag	Which workload was hit by controller recovery?
DMA/IOMMU	Map latency, bounce, and fault tracepoints	Was the fast path actually used?
Filesystem	Checkpoint integrity hook	Was the checkpoint complete and correct?

4. Patch 1 — AI-aware I/O tags in blk-mq

Linux already carries rich metadata through the block layer, but it does not carry a stable application-level lineage identifier. We add a compact u64 ai_tag to block requests and expose it through a controlled io_uring path.

Kernel patch sketch

// include/linux/blk_types.h
struct bio {
        ...
        u64 bi_ai_tag;
};

// include/linux/blk-mq.h or include/linux/blkdev.h
struct request {
        ...
        u64 ai_tag;
};

// block/blk-mq.c
static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
{
        ...
        rq->ai_tag = bio->bi_ai_tag;
}

Tracepoint

// include/trace/events/block.h
TRACE_EVENT(block_ai_rq_complete,
        TP_PROTO(struct request *rq, u64 latency_ns),
        TP_ARGS(rq, latency_ns),

        TP_STRUCT__entry(
                __field(u64, ai_tag)
                __field(dev_t, dev)
                __field(sector_t, sector)
                __field(unsigned int, nr_bytes)
                __field(u64, latency_ns)
        ),

        TP_fast_assign(
                __entry->ai_tag = rq->ai_tag;
                __entry->dev = disk_devt(rq->q->disk);
                __entry->sector = blk_rq_pos(rq);
                __entry->nr_bytes = blk_rq_bytes(rq);
                __entry->latency_ns = latency_ns;
        ),

        TP_printk("ai_tag=%llu dev=%d:%d sector=%llu bytes=%u latency_ns=%llu",
                __entry->ai_tag, MAJOR(__entry->dev), MINOR(__entry->dev),
                (unsigned long long)__entry->sector, __entry->nr_bytes,
                __entry->latency_ns)
);

Upstream discipline: the first version should be framed as a generic request-correlation tag, not an AI-only semantic field. The AI use case is the motivating example; the kernel API should remain general.

5. Patch 2 — NVMe timeout and reset attribution

NVMe timeouts are one of the worst cluster incidents because they can create long GPU starvation intervals while still looking like “storage noise” in logs. The fix is to attach the propagated tag to timeout, abort, and reset paths.

// drivers/nvme/host/core.c
static enum blk_eh_timer_return nvme_timeout(struct request *req)
{
        struct nvme_request *nr = nvme_req(req);
        struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;

        trace_nvme_ai_timeout(ctrl->instance,
                              nvmeq->qid,
                              req->tag,
                              req->ai_tag,
                              blk_rq_bytes(req));

        dev_warn(ctrl->device,
                 "I/O timeout: qid=%u tag=%d ai_tag=%llu bytes=%u\n",
                 nvmeq->qid, req->tag,
                 req->ai_tag, blk_rq_bytes(req));

        return BLK_EH_RESET_TIMER;
}

// include/trace/events/nvme.h
TRACE_EVENT(nvme_ai_timeout,
        TP_PROTO(int ctrl, int qid, int rq_tag, u64 ai_tag, unsigned int bytes),
        TP_ARGS(ctrl, qid, rq_tag, ai_tag, bytes),
        TP_STRUCT__entry(
                __field(int, ctrl)
                __field(int, qid)
                __field(int, rq_tag)
                __field(u64, ai_tag)
                __field(unsigned int, bytes)
        ),
        TP_fast_assign(
                __entry->ctrl = ctrl;
                __entry->qid = qid;
                __entry->rq_tag = rq_tag;
                __entry->ai_tag = ai_tag;
                __entry->bytes = bytes;
        ),
        TP_printk("nvme%d qid=%d rq_tag=%d ai_tag=%llu bytes=%u",
                __entry->ctrl, __entry->qid, __entry->rq_tag,
                __entry->ai_tag, __entry->bytes)
);

This does not magically prevent resets. It makes resets attributable. That is the difference between “training was slow last night” and “checkpoint shard 17 on nvme2n1 repeatedly triggered controller recovery during step 80,042.”

6. Patch 3 — DMA/IOMMU visibility for the hidden fast path

GPU clusters rely on the assumption that DMA is direct and efficient. But when IOVA allocation is slow, mappings fragment, or traffic falls back to a bounce buffer, user space rarely sees the reason. This patch adds tracepoints around DMA map/unmap latency and fallback behavior.

// kernel/dma/mapping.c
dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
                              size_t offset, size_t size,
                              enum dma_data_direction dir,
                              unsigned long attrs)
{
        u64 start_ns = ktime_get_ns();
        dma_addr_t dma_addr;

        dma_addr = __dma_map_page_attrs(dev, page, offset, size, dir, attrs);

        trace_dma_ai_map_latency(dev_name(dev), size, dir,
                                 ktime_get_ns() - start_ns,
                                 is_swiotlb_buffer(dev, dma_addr));

        return dma_addr;
}

// include/trace/events/dma.h
TRACE_EVENT(dma_ai_map_latency,
        TP_PROTO(const char *dev, size_t size, int dir, u64 latency_ns, bool bounced),
        TP_ARGS(dev, size, dir, latency_ns, bounced),
        TP_STRUCT__entry(
                __string(dev, dev)
                __field(size_t, size)
                __field(int, dir)
                __field(u64, latency_ns)
                __field(bool, bounced)
        ),
        TP_fast_assign(
                __assign_str(dev, dev);
                __entry->size = size;
                __entry->dir = dir;
                __entry->latency_ns = latency_ns;
                __entry->bounced = bounced;
        ),
        TP_printk("dev=%s size=%zu dir=%d latency_ns=%llu bounced=%d",
                __get_str(dev), __entry->size, __entry->dir,
                __entry->latency_ns, __entry->bounced)
);

7. Patch 4 — checkpoint integrity hooks

Checkpoint files deserve first-class observability because they are both large and correctness-critical. A partial, stale, or corrupted checkpoint can waste hours of cluster time or produce mysterious divergence after resume.

The kernel should not understand model formats, but it can provide a generic integrity hook that user space opts into with an xattr or open flag.

# userspace convention
setfattr -n user.ai.checkpoint -v "sha256:enabled" /checkpoints/run42/step80000.pt

// fs/read_write.c — sketch only
static inline bool file_ai_checkpoint_enabled(struct file *file)
{
        return file_inode(file)->i_flags & S_AI_CHECKPOINT;
}

ssize_t vfs_write(struct file *file, const char __user *buf,
                  size_t count, loff_t *pos)
{
        ssize_t ret;
        u64 start = ktime_get_ns();

        ret = __vfs_write(file, buf, count, pos);

        if (ret > 0 && file_ai_checkpoint_enabled(file))
                trace_ai_checkpoint_write(file, *pos, ret,
                                          ktime_get_ns() - start);

        return ret;
}

A production version would avoid hashing in hot VFS paths unless explicitly enabled. The first patch should trace checkpoint write extents and completion ordering. User space can compute cryptographic hashes asynchronously.

8. Matching eBPF tooling: `ai-io-trace`

The kernel patches expose tracepoints. The user-space collector makes them valuable by joining them into one timeline.

Goal: run one tool during training and get a plain-English root cause report for GPU stalls: NVMe timeout, block tail latency, DMA bounce, IOMMU fault, or checkpoint writeback stall.

BPF program sketch

// bpf/ai_io_trace.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

char LICENSE[] SEC("license") = "GPL";

struct ai_event {
        u64 ts_ns;
        u64 ai_tag;
        u32 type;
        u32 pid;
        u64 latency_ns;
        u64 bytes;
        char dev[32];
};

struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 1 << 24);
} events SEC(".maps");

enum {
        EV_BLOCK_COMPLETE = 1,
        EV_NVME_TIMEOUT   = 2,
        EV_DMA_MAP        = 3,
        EV_DMA_BOUNCE     = 4,
        EV_CKPT_WRITE     = 5,
};

SEC("tracepoint/block/block_ai_rq_complete")
int handle_block_complete(void *ctx)
{
        struct ai_event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
        if (!e) return 0;

        e->ts_ns = bpf_ktime_get_ns();
        e->type = EV_BLOCK_COMPLETE;
        e->pid = bpf_get_current_pid_tgid() >> 32;
        // Read ai_tag, latency_ns, bytes from tracepoint context in real code.
        bpf_ringbuf_submit(e, 0);
        return 0;
}

SEC("tracepoint/nvme/nvme_ai_timeout")
int handle_nvme_timeout(void *ctx)
{
        struct ai_event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
        if (!e) return 0;

        e->ts_ns = bpf_ktime_get_ns();
        e->type = EV_NVME_TIMEOUT;
        e->pid = bpf_get_current_pid_tgid() >> 32;
        bpf_ringbuf_submit(e, 0);
        return 0;
}

SEC("tracepoint/dma/dma_ai_map_latency")
int handle_dma_map(void *ctx)
{
        struct ai_event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
        if (!e) return 0;

        e->ts_ns = bpf_ktime_get_ns();
        e->type = EV_DMA_MAP;
        e->pid = bpf_get_current_pid_tgid() >> 32;
        bpf_ringbuf_submit(e, 0);
        return 0;
}

User-space collector sketch

# tools/ai_io_trace.py
from bcc import BPF
from collections import defaultdict, deque
import json, time, subprocess

b = BPF(src_file="bpf/ai_io_trace.bpf.c")

recent = defaultdict(lambda: deque(maxlen=2048))

def gpu_util_snapshot():
    try:
        out = subprocess.check_output([
            "nvidia-smi",
            "--query-gpu=timestamp,index,utilization.gpu,utilization.memory",
            "--format=csv,noheader,nounits"
        ], text=True, timeout=1)
        return out.strip()
    except Exception:
        return "gpu_metrics_unavailable"

def on_event(cpu, data, size):
    e = b["events"].event(data)
    item = {
        "ts_ns": e.ts_ns,
        "ai_tag": e.ai_tag,
        "type": e.type,
        "pid": e.pid,
        "latency_ns": e.latency_ns,
        "bytes": e.bytes,
        "gpu": gpu_util_snapshot(),
    }
    recent[e.ai_tag].append(item)
    print(json.dumps(item))

b["events"].open_ring_buffer(on_event)

while True:
    b.ring_buffer_poll(timeout=100)

Example output

{"ai_tag": 80042, "type": "NVME_TIMEOUT", "dev": "nvme2n1", "latency_ms": 1240.3,
 "gpu": "2026/04/28 22:15:04, 0, 12, 18"}

[ai-io-trace] GPU stall attribution
  step: 80042
  likely cause: NVMe controller timeout
  device: nvme2n1
  affected path: checkpoint write
  correlated DMA symptom: no bounce; map latency normal
  action: inspect smart-log/error-log, firmware, thermal throttling, PCIe AER

9. What the final report should look like

The win is not a prettier trace log. The win is converting a distributed failure into a root-cause narrative.

AI I/O Stall Report
===================
Window: 2026-04-28T22:14:40Z → 2026-04-28T22:15:10Z
GPU: 0, 1, 2, 3
Symptom: SM utilization fell from 94% to 17% for 3.8s

Primary cause:
  NVMe timeout on nvme2n1, queue 7, request tag 4096

Workload lineage:
  ai_tag: 80042
  semantic label: checkpoint_write/run42/step80000/shard17

Kernel path:
  VFS write → XFS extent writeback → blk-mq dispatch → NVMe timeout → reset recovery

Secondary signals:
  DMA map latency normal
  No SWIOTLB bounce detected
  No IOMMU fault detected

Recommended next actions:
  1. Check nvme smart-log and error-log for nvme2n1
  2. Inspect PCIe AER for the device/root port
  3. Compare p999 latency during checkpoint windows
  4. Consider separating checkpoint device from dataset read device

10. How to make this upstream-plausible

Do not pitch this as “the kernel should know about AI.” Pitch it as a general, opt-in request attribution mechanism motivated by AI clusters.

Good framing

Generic I/O lineage tags, tracepoints, and opt-in integrity metadata for high-value files.

Bad framing

Hard-code PyTorch, NCCL, checkpoint formats, model names, or GPU vendor assumptions into core kernel paths.

The first mergeable slice is likely not the full patchset. A realistic sequence is:

Add tracepoints only, without new behavior.
Add generic request correlation tags behind feature flags.
Add DMA/IOMMU latency and bounce observability.
Build user-space tooling that proves real operational value.

Conclusion: AI reliability needs kernel-level causality

The next frontier in AI infrastructure is not only faster GPUs. It is making the non-GPU path explainable when the GPU is waiting.

Storage errors, controller resets, DMA fallbacks, and checkpoint integrity problems are not edge cases. They are production realities. Today, they are difficult to attribute because every subsystem logs its local truth while the full request lineage is lost.

The patchset in one sentence: carry a workload lineage tag through the Linux I/O path, emit tracepoints at the failure boundaries, and use eBPF to correlate kernel events with GPU stalls.

That turns “GPU utilization dropped” into a precise engineering answer: which request, which device, which queue, which mapping path, which checkpoint, and which kernel subsystem caused the stall.

When GPUs Stall but Nothing Looks Wrong

Thesis

Patchset map

1. The problem: the GPU is idle, but the root cause is invisible

2. Failure modes that matter in AI clusters

NVMe timeout/reset

p999 read latency

Checkpoint write stalls

DMA/IOMMU faults

3. Proposed architecture: request lineage for AI I/O

4. Patch 1 — AI-aware I/O tags in blk-mq

Kernel patch sketch

Tracepoint

5. Patch 2 — NVMe timeout and reset attribution

6. Patch 3 — DMA/IOMMU visibility for the hidden fast path

7. Patch 4 — checkpoint integrity hooks

8. Matching eBPF tooling: `ai-io-trace`

BPF program sketch

User-space collector sketch

Example output

9. What the final report should look like

10. How to make this upstream-plausible

Good framing

Bad framing

Conclusion: AI reliability needs kernel-level causality

1. The problem: the GPU is idle, but the root cause is invisible

2. Failure modes that matter in AI clusters

NVMe timeout/reset

p999 read latency

Checkpoint write stalls

DMA/IOMMU faults

3. Proposed architecture: request lineage for AI I/O

4. Patch 1 — AI-aware I/O tags in blk-mq

Kernel patch sketch

Tracepoint

5. Patch 2 — NVMe timeout and reset attribution

6. Patch 3 — DMA/IOMMU visibility for the hidden fast path

7. Patch 4 — checkpoint integrity hooks

8. Matching eBPF tooling: ai-io-trace

BPF program sketch

User-space collector sketch

Example output

9. What the final report should look like

10. How to make this upstream-plausible

Good framing

Bad framing

Conclusion: AI reliability needs kernel-level causality

8. Matching eBPF tooling: `ai-io-trace`