← All posts
AI infra · PCIe AER · Linux kernel observability

PCIe Errors: The Hidden Tax on AI Inference Clusters

Correctable PCIe errors rarely crash a node. That is exactly why they are dangerous: they quietly degrade data movement, reduce effective GPU capacity, and hide under “mysterious inference slowdown.”

10–30%Potential effective throughput loss on degraded nodes under sustained load
3–5×Higher hard-failure likelihood after sustained correctable-error activity
1st-classPCIe telemetry should feed scheduling, health scoring, and capacity models

In modern inference systems, the expensive thing is not only compute. It is reliable movement of data: CPU↔GPU transfers, GPU↔NIC DMA, GPU↔NVMe paths, and sometimes PCIe switch fabrics sitting under multiple accelerators. When that layer becomes noisy, a node can remain alive while becoming economically inefficient.

PCIe AERGPU inferenceCorrectable errorsKernel tracepointsPrometheus-ready

PCIe Is Not Just “I/O” Anymore

At a superficial level, PCI Express connects I/O devices: GPUs, NICs, NVMe drives, accelerator cards, and chipset devices. In an AI server, however, PCIe is better understood as part of the data movement fabric.

That distinction matters. If inference is memory- and movement-bound, then a degraded PCIe path is not a peripheral issue. It directly reduces the useful capacity of the node.

CPU / Root Complex PCIe Switch / Fabric GPUNICNVMeGPU A noisy link can create retries, backpressure, and shared-fabric jitter.

Correctable Errors Are the Dangerous Ones

The most interesting PCIe errors are often correctable. They are handled by hardware and firmware. They do not panic the kernel. They do not necessarily kill a device. The box stays up.

But “corrected” does not mean “free.” Correctable errors can cause replay, retransmission, flow-control stalls, link retraining, and short bursts of backpressure. That becomes visible as lower effective throughput, higher queue depth, and wider latency tails.

The trap: high-level monitoring may show “GPU underutilized” or “network fine,” while the real cause is below both layers: the PCIe path is no longer clean.

Why Gen5 Makes This More Real

This matters even more in PCIe Gen5-era systems common around H100-class and newer accelerator platforms. Higher signaling rates reduce margin: signal integrity issues that were tolerable at Gen4 can become visible as retries, link instability, or negotiated-speed compromises at Gen5.

The causes are often boring but expensive: marginal risers, cable seating, thermal expansion after long sustained load, vibration in dense racks, airflow-induced mechanical stress, aging connectors, or a switch/root-port path that is only barely within margin. The workload may be “AI inference,” but the failure mode can be physical.

What Linux Already Gives You

Linux already exposes useful PCIe error information through AER, sysfs, kernel logs, and PCI config-space tooling. The problem is not absence of signal. The problem is that the signal is scattered, noisy, and rarely correlated with inference performance.

1. Kernel logs

dmesg -T | grep -i aer
dmesg -T | grep -i pcie
dmesg -T | grep -Ei 'aer|pcie|corrected|uncorrected'

2. Per-device sysfs counters

cd /sys/bus/pci/devices/
ls

# Example BDF
cd /sys/bus/pci/devices/0000:65:00.0/
ls

# If exposed by the kernel/platform:
cat aer_dev_correctable
cat aer_dev_nonfatal
cat aer_dev_fatal

3. PCIe link inspection

lspci -vvv -s 65:00.0

# Quick scan for negotiated link speed/width
lspci -vv | grep -i "LnkSta"

Watch for downtraining: a device that should be Gen5 x16 but is running Gen4 x8 is not a minor issue. That directly changes the data movement envelope of the node.

Important: some BIOS/firmware stacks may silently train a marginal link at a lower width or generation to preserve stability. A standard monitor may still report the link as “up” or “healthy,” while effective bandwidth has been cut dramatically. In a GPU node, Gen5 x16 becoming x8 is not a log curiosity; it is a capacity event.

4. Topology mapping

lspci -t

# Map device IDs to vendors/classes
lspci -nn

# Show kernel driver bound to each device
lspci -k -s 65:00.0

5. NVIDIA-specific checks

nvidia-smi -q | grep -i pcie -A 20
nvidia-smi dmon -s u

If GPU utilization is low during heavy inference load, do not assume scheduler inefficiency first. Check whether PCIe replay/link behavior is degrading the data path.

Temporary Debug Mode: More Verbose PCIe/AER Logging

For a lab node, you can temporarily increase logging and enable dynamic debug for PCIe/AER paths. Do this carefully; correctable errors can be high volume.

# Increase kernel console verbosity
sudo sh -c 'echo 8 > /proc/sys/kernel/printk'

# Check whether dynamic debug exists
sudo cat /sys/kernel/debug/dynamic_debug/control | head

# Enable debug prints for PCIe port and AER modules where available
sudo sh -c "echo 'module pcieport +p' > /sys/kernel/debug/dynamic_debug/control"
sudo sh -c "echo 'module aer +p' > /sys/kernel/debug/dynamic_debug/control"

# Watch logs
sudo dmesg -wT | grep -Ei 'aer|pcie|corrected|uncorrected'
Production caution: do not leave noisy printk-style debug enabled across a fleet. Prefer tracepoints or counters for continuous observability.

A Simple Userspace Collector

A fast first step is to sample AER counters and emit a per-BDF rate. Rate matters more than raw cumulative counters.

#!/usr/bin/env bash
# pcie_aer_sample.sh
# Sample PCIe AER counters from sysfs.

while true; do
  ts=$(date +%s)
  for dev in /sys/bus/pci/devices/*; do
    bdf=$(basename "$dev")
    for f in aer_dev_correctable aer_dev_nonfatal aer_dev_fatal; do
      if [ -f "$dev/$f" ]; then
        val=$(cat "$dev/$f" 2>/dev/null)
        echo "${ts},${bdf},${f},${val}"
      fi
    done
  done
  sleep 5
done

From there, join BDFs to GPUs/NICs/NVMe devices using lspci -nn, nvidia-smi topo -m, and driver bindings.

Kernel Patch 1: Ratelimited AER Logging

The easiest kernel change is to add richer ratelimited logging in the AER error print path. This is useful for development and lab validation.

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index XXXXXXX..YYYYYYY 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -xxx,6 +xxx,26 @@ static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
        int layer, agent;
 
+       /*
+        * Extra PCIe observability for AI/inference nodes.
+        *
+        * Correctable errors are often hardware-recovered, but not free.
+        * They may correlate with replay, link jitter, throughput loss,
+        * and eventual node failure. Keep this ratelimited because high
+        * error-rate systems can produce large volumes of events.
+        */
+       if (info->severity == AER_CORRECTABLE) {
+               pci_info_ratelimited(dev,
+                       "AER_DEBUG corrected: domain=%04x bus=%02x devfn=%02x "
+                       "vendor=%04x device=%04x status=0x%08x mask=0x%08x "
+                       "first_error=%d\n",
+                       pci_domain_nr(dev->bus), dev->bus->number, dev->devfn,
+                       dev->vendor, dev->device,
+                       info->status, info->mask, info->first_error);
+       }
+
        if (info->severity == AER_CORRECTABLE)
                pci_warn(dev, "PCIe Bus Error: severity=Corrected, ...");

This is intentionally simple. It prints enough context to correlate a device BDF against GPU/NIC/NVMe topology and workload-level degradation.

Patch 1 is for lab validation. Even ratelimited logging can be dropped, sampled away, or missed by standard log aggregation during bursts. It also pays the cost of formatting strings in kernel context. For anything beyond a handful of nodes, Patch 2 is the preferred direction.

Kernel Patch 2: Tracepoint-Based AER Telemetry

For a production-grade path, a tracepoint is better than printk. It is structured, filterable, and easier to consume with trace-cmd, perf, eBPF, or a small collector. It avoids making observability depend on kernel string logs, which is exactly what you do not want when thousands of correctable events may arrive during a degraded interval.

Add a trace event header

diff --git a/include/trace/events/pci_aer_extra.h b/include/trace/events/pci_aer_extra.h
new file mode 100644
index 0000000..1111111
--- /dev/null
+++ b/include/trace/events/pci_aer_extra.h
@@ -0,0 +1,58 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pci_aer_extra
+
+#if !defined(_TRACE_PCI_AER_EXTRA_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PCI_AER_EXTRA_H
+
+#include <linux/tracepoint.h>
+#include <linux/pci.h>
+
+TRACE_EVENT(pci_aer_correctable_extra,
+       TP_PROTO(struct pci_dev *dev, u32 status, u32 mask, int first_error),
+
+       TP_ARGS(dev, status, mask, first_error),
+
+       TP_STRUCT__entry(
+               __field(u16, domain)
+               __field(u8, bus)
+               __field(u8, devfn)
+               __field(u16, vendor)
+               __field(u16, device)
+               __field(u32, status)
+               __field(u32, mask)
+               __field(int, first_error)
+       ),
+
+       TP_fast_assign(
+               __entry->domain = pci_domain_nr(dev->bus);
+               __entry->bus = dev->bus->number;
+               __entry->devfn = dev->devfn;
+               __entry->vendor = dev->vendor;
+               __entry->device = dev->device;
+               __entry->status = status;
+               __entry->mask = mask;
+               __entry->first_error = first_error;
+       ),
+
+       TP_printk("domain=%04x bus=%02x devfn=%02x vendor=%04x device=%04x "
+                 "status=0x%08x mask=0x%08x first_error=%d",
+                 __entry->domain, __entry->bus, __entry->devfn,
+                 __entry->vendor, __entry->device,
+                 __entry->status, __entry->mask, __entry->first_error)
+);
+
+#endif
+
+#include <trace/define_trace.h>

Call the tracepoint from AER

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index XXXXXXX..YYYYYYY 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -xx,6 +xx,9 @@
 #include <linux/interrupt.h>
 #include <linux/kfifo.h>
 #include <linux/slab.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/pci_aer_extra.h>
 
 static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
+       if (info->severity == AER_CORRECTABLE)
+               trace_pci_aer_correctable_extra(dev,
+                       info->status, info->mask, info->first_error);
+
        /* existing AER print/recovery logic continues here */
 }

Consume the tracepoint

# List events
sudo ls /sys/kernel/debug/tracing/events/pci_aer_extra/

# Enable event live
sudo sh -c 'echo 1 > /sys/kernel/debug/tracing/events/pci_aer_extra/pci_aer_correctable_extra/enable'
sudo cat /sys/kernel/debug/tracing/trace_pipe

# Or record with trace-cmd
sudo trace-cmd record -e pci_aer_extra:pci_aer_correctable_extra
sudo trace-cmd report

From Counters to a PCIe Health Score

The real product is not the kernel patch. The product is the control loop.

PCIe AER counters / tracepoints
        ↓
per-BDF error rate
        ↓
BDF → GPU / NIC / NVMe / root-port mapping
        ↓
node-level PCIe health score
        ↓
scheduler avoids degraded nodes or lowers their effective capacity
SignalInterpretationAction
Correctable error rate risingEarly degradation / noisy linkWatch, correlate with tokens/sec
Nonfatal errors increasingPotential recovery eventsDrain or quarantine node
Link downtrainingReduced effective PCIe capacityMark degraded immediately
Root-port errorsShared fabric instabilityPrefer node-level quarantine
Endpoint-only burstDevice-specific issueMap to GPU/NIC/NVMe and route around

Example scoring sketch

health_score = 100
health_score -= min(40, correctable_errors_per_minute * 0.5)
health_score -= min(25, replay_or_retrain_events_per_minute * 5)
health_score -= 30 if link_downtrained else 0
health_score -= 35 if nonfatal_errors_recent else 0
health_score -= 50 if root_port_errors_sustained else 0

if health_score < 60:
    mark_node_degraded()
if health_score < 40:
    drain_node()

The Economic Framing

The point is not that every PCIe error matters. The point is that sustained PCIe error activity changes effective GPU inventory.

A fleet may nominally have 10,000 GPUs. But if a subset of nodes is silently degraded, the usable capacity is lower. That means higher queueing, worse latency, more overprovisioning, and lower return on deployed accelerators.

The most painful version is tail latency. In multi-GPU inference, one weak PCIe path can behave like a straggler. If a request depends on a collective operation, a GPU↔NIC DMA path, or a shared switch path, the slow participant can hold back the whole batch. The average may look acceptable while P99 gets ugly, tenants see inconsistent service, and the scheduler keeps feeding work to a node that is technically alive but economically impaired.

Better mental model: capacity is not “number of GPUs.” Capacity is “number of healthy, schedulable, data-movement-clean GPUs.”

Closing: Is the Node Alive, or Is It Efficient?

Infrastructure teams already ask whether a node is alive. AI inference economics requires a stricter question:

Is the node still efficient?

PCIe errors are one of the cleanest low-level signals for answering that question. Treating them as mere hardware noise misses the bigger systems story: degraded data movement is degraded compute.

Appendix: Command Cheat Sheet

# Logs
dmesg -T | grep -Ei 'aer|pcie|corrected|uncorrected'

# PCI topology
lspci -t
lspci -nn
lspci -k -s <BDF>

# Deep PCIe capability dump
lspci -vvv -s <BDF>

# Link speed/width scan
lspci -vv | grep -i 'LnkSta'

# sysfs AER counters
cat /sys/bus/pci/devices/<BDF>/aer_dev_correctable
cat /sys/bus/pci/devices/<BDF>/aer_dev_nonfatal
cat /sys/bus/pci/devices/<BDF>/aer_dev_fatal

# Dynamic debug, lab only
sudo sh -c 'echo 8 > /proc/sys/kernel/printk'
sudo sh -c "echo 'module pcieport +p' > /sys/kernel/debug/dynamic_debug/control"
sudo sh -c "echo 'module aer +p' > /sys/kernel/debug/dynamic_debug/control"

# NVIDIA checks
nvidia-smi -q | grep -i pcie -A 20
nvidia-smi topo -m
nvidia-smi dmon -s u

# Tracepoint consumption after kernel patch
sudo trace-cmd record -e pci_aer_extra:pci_aer_correctable_extra
sudo trace-cmd report