In modern inference systems, the expensive thing is not only compute. It is reliable movement of data: CPU↔GPU transfers, GPU↔NIC DMA, GPU↔NVMe paths, and sometimes PCIe switch fabrics sitting under multiple accelerators. When that layer becomes noisy, a node can remain alive while becoming economically inefficient.
PCIe Is Not Just “I/O” Anymore
At a superficial level, PCI Express connects I/O devices: GPUs, NICs, NVMe drives, accelerator cards, and chipset devices. In an AI server, however, PCIe is better understood as part of the data movement fabric.
That distinction matters. If inference is memory- and movement-bound, then a degraded PCIe path is not a peripheral issue. It directly reduces the useful capacity of the node.
Correctable Errors Are the Dangerous Ones
The most interesting PCIe errors are often correctable. They are handled by hardware and firmware. They do not panic the kernel. They do not necessarily kill a device. The box stays up.
But “corrected” does not mean “free.” Correctable errors can cause replay, retransmission, flow-control stalls, link retraining, and short bursts of backpressure. That becomes visible as lower effective throughput, higher queue depth, and wider latency tails.
Why Gen5 Makes This More Real
This matters even more in PCIe Gen5-era systems common around H100-class and newer accelerator platforms. Higher signaling rates reduce margin: signal integrity issues that were tolerable at Gen4 can become visible as retries, link instability, or negotiated-speed compromises at Gen5.
The causes are often boring but expensive: marginal risers, cable seating, thermal expansion after long sustained load, vibration in dense racks, airflow-induced mechanical stress, aging connectors, or a switch/root-port path that is only barely within margin. The workload may be “AI inference,” but the failure mode can be physical.
What Linux Already Gives You
Linux already exposes useful PCIe error information through AER, sysfs, kernel logs, and PCI config-space tooling. The problem is not absence of signal. The problem is that the signal is scattered, noisy, and rarely correlated with inference performance.
1. Kernel logs
dmesg -T | grep -i aer
dmesg -T | grep -i pcie
dmesg -T | grep -Ei 'aer|pcie|corrected|uncorrected'
2. Per-device sysfs counters
cd /sys/bus/pci/devices/
ls
# Example BDF
cd /sys/bus/pci/devices/0000:65:00.0/
ls
# If exposed by the kernel/platform:
cat aer_dev_correctable
cat aer_dev_nonfatal
cat aer_dev_fatal
3. PCIe link inspection
lspci -vvv -s 65:00.0
# Quick scan for negotiated link speed/width
lspci -vv | grep -i "LnkSta"
Watch for downtraining: a device that should be Gen5 x16 but is running Gen4 x8 is not a minor issue. That directly changes the data movement envelope of the node.
4. Topology mapping
lspci -t
# Map device IDs to vendors/classes
lspci -nn
# Show kernel driver bound to each device
lspci -k -s 65:00.0
5. NVIDIA-specific checks
nvidia-smi -q | grep -i pcie -A 20
nvidia-smi dmon -s u
If GPU utilization is low during heavy inference load, do not assume scheduler inefficiency first. Check whether PCIe replay/link behavior is degrading the data path.
Temporary Debug Mode: More Verbose PCIe/AER Logging
For a lab node, you can temporarily increase logging and enable dynamic debug for PCIe/AER paths. Do this carefully; correctable errors can be high volume.
# Increase kernel console verbosity
sudo sh -c 'echo 8 > /proc/sys/kernel/printk'
# Check whether dynamic debug exists
sudo cat /sys/kernel/debug/dynamic_debug/control | head
# Enable debug prints for PCIe port and AER modules where available
sudo sh -c "echo 'module pcieport +p' > /sys/kernel/debug/dynamic_debug/control"
sudo sh -c "echo 'module aer +p' > /sys/kernel/debug/dynamic_debug/control"
# Watch logs
sudo dmesg -wT | grep -Ei 'aer|pcie|corrected|uncorrected'
A Simple Userspace Collector
A fast first step is to sample AER counters and emit a per-BDF rate. Rate matters more than raw cumulative counters.
#!/usr/bin/env bash
# pcie_aer_sample.sh
# Sample PCIe AER counters from sysfs.
while true; do
ts=$(date +%s)
for dev in /sys/bus/pci/devices/*; do
bdf=$(basename "$dev")
for f in aer_dev_correctable aer_dev_nonfatal aer_dev_fatal; do
if [ -f "$dev/$f" ]; then
val=$(cat "$dev/$f" 2>/dev/null)
echo "${ts},${bdf},${f},${val}"
fi
done
done
sleep 5
done
From there, join BDFs to GPUs/NICs/NVMe devices using lspci -nn, nvidia-smi topo -m, and driver bindings.
Kernel Patch 1: Ratelimited AER Logging
The easiest kernel change is to add richer ratelimited logging in the AER error print path. This is useful for development and lab validation.
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index XXXXXXX..YYYYYYY 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -xxx,6 +xxx,26 @@ static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
{
int layer, agent;
+ /*
+ * Extra PCIe observability for AI/inference nodes.
+ *
+ * Correctable errors are often hardware-recovered, but not free.
+ * They may correlate with replay, link jitter, throughput loss,
+ * and eventual node failure. Keep this ratelimited because high
+ * error-rate systems can produce large volumes of events.
+ */
+ if (info->severity == AER_CORRECTABLE) {
+ pci_info_ratelimited(dev,
+ "AER_DEBUG corrected: domain=%04x bus=%02x devfn=%02x "
+ "vendor=%04x device=%04x status=0x%08x mask=0x%08x "
+ "first_error=%d\n",
+ pci_domain_nr(dev->bus), dev->bus->number, dev->devfn,
+ dev->vendor, dev->device,
+ info->status, info->mask, info->first_error);
+ }
+
if (info->severity == AER_CORRECTABLE)
pci_warn(dev, "PCIe Bus Error: severity=Corrected, ...");
This is intentionally simple. It prints enough context to correlate a device BDF against GPU/NIC/NVMe topology and workload-level degradation.
Kernel Patch 2: Tracepoint-Based AER Telemetry
For a production-grade path, a tracepoint is better than printk. It is structured, filterable, and easier to consume with trace-cmd, perf, eBPF, or a small collector. It avoids making observability depend on kernel string logs, which is exactly what you do not want when thousands of correctable events may arrive during a degraded interval.
Add a trace event header
diff --git a/include/trace/events/pci_aer_extra.h b/include/trace/events/pci_aer_extra.h
new file mode 100644
index 0000000..1111111
--- /dev/null
+++ b/include/trace/events/pci_aer_extra.h
@@ -0,0 +1,58 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pci_aer_extra
+
+#if !defined(_TRACE_PCI_AER_EXTRA_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PCI_AER_EXTRA_H
+
+#include <linux/tracepoint.h>
+#include <linux/pci.h>
+
+TRACE_EVENT(pci_aer_correctable_extra,
+ TP_PROTO(struct pci_dev *dev, u32 status, u32 mask, int first_error),
+
+ TP_ARGS(dev, status, mask, first_error),
+
+ TP_STRUCT__entry(
+ __field(u16, domain)
+ __field(u8, bus)
+ __field(u8, devfn)
+ __field(u16, vendor)
+ __field(u16, device)
+ __field(u32, status)
+ __field(u32, mask)
+ __field(int, first_error)
+ ),
+
+ TP_fast_assign(
+ __entry->domain = pci_domain_nr(dev->bus);
+ __entry->bus = dev->bus->number;
+ __entry->devfn = dev->devfn;
+ __entry->vendor = dev->vendor;
+ __entry->device = dev->device;
+ __entry->status = status;
+ __entry->mask = mask;
+ __entry->first_error = first_error;
+ ),
+
+ TP_printk("domain=%04x bus=%02x devfn=%02x vendor=%04x device=%04x "
+ "status=0x%08x mask=0x%08x first_error=%d",
+ __entry->domain, __entry->bus, __entry->devfn,
+ __entry->vendor, __entry->device,
+ __entry->status, __entry->mask, __entry->first_error)
+);
+
+#endif
+
+#include <trace/define_trace.h>
Call the tracepoint from AER
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index XXXXXXX..YYYYYYY 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -xx,6 +xx,9 @@
#include <linux/interrupt.h>
#include <linux/kfifo.h>
#include <linux/slab.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/pci_aer_extra.h>
static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
{
+ if (info->severity == AER_CORRECTABLE)
+ trace_pci_aer_correctable_extra(dev,
+ info->status, info->mask, info->first_error);
+
/* existing AER print/recovery logic continues here */
}
Consume the tracepoint
# List events
sudo ls /sys/kernel/debug/tracing/events/pci_aer_extra/
# Enable event live
sudo sh -c 'echo 1 > /sys/kernel/debug/tracing/events/pci_aer_extra/pci_aer_correctable_extra/enable'
sudo cat /sys/kernel/debug/tracing/trace_pipe
# Or record with trace-cmd
sudo trace-cmd record -e pci_aer_extra:pci_aer_correctable_extra
sudo trace-cmd report
From Counters to a PCIe Health Score
The real product is not the kernel patch. The product is the control loop.
PCIe AER counters / tracepoints
↓
per-BDF error rate
↓
BDF → GPU / NIC / NVMe / root-port mapping
↓
node-level PCIe health score
↓
scheduler avoids degraded nodes or lowers their effective capacity
| Signal | Interpretation | Action |
|---|---|---|
| Correctable error rate rising | Early degradation / noisy link | Watch, correlate with tokens/sec |
| Nonfatal errors increasing | Potential recovery events | Drain or quarantine node |
| Link downtraining | Reduced effective PCIe capacity | Mark degraded immediately |
| Root-port errors | Shared fabric instability | Prefer node-level quarantine |
| Endpoint-only burst | Device-specific issue | Map to GPU/NIC/NVMe and route around |
Example scoring sketch
health_score = 100
health_score -= min(40, correctable_errors_per_minute * 0.5)
health_score -= min(25, replay_or_retrain_events_per_minute * 5)
health_score -= 30 if link_downtrained else 0
health_score -= 35 if nonfatal_errors_recent else 0
health_score -= 50 if root_port_errors_sustained else 0
if health_score < 60:
mark_node_degraded()
if health_score < 40:
drain_node()
The Economic Framing
The point is not that every PCIe error matters. The point is that sustained PCIe error activity changes effective GPU inventory.
A fleet may nominally have 10,000 GPUs. But if a subset of nodes is silently degraded, the usable capacity is lower. That means higher queueing, worse latency, more overprovisioning, and lower return on deployed accelerators.
The most painful version is tail latency. In multi-GPU inference, one weak PCIe path can behave like a straggler. If a request depends on a collective operation, a GPU↔NIC DMA path, or a shared switch path, the slow participant can hold back the whole batch. The average may look acceptable while P99 gets ugly, tenants see inconsistent service, and the scheduler keeps feeding work to a node that is technically alive but economically impaired.
Closing: Is the Node Alive, or Is It Efficient?
Infrastructure teams already ask whether a node is alive. AI inference economics requires a stricter question:
PCIe errors are one of the cleanest low-level signals for answering that question. Treating them as mere hardware noise misses the bigger systems story: degraded data movement is degraded compute.
Appendix: Command Cheat Sheet
# Logs
dmesg -T | grep -Ei 'aer|pcie|corrected|uncorrected'
# PCI topology
lspci -t
lspci -nn
lspci -k -s <BDF>
# Deep PCIe capability dump
lspci -vvv -s <BDF>
# Link speed/width scan
lspci -vv | grep -i 'LnkSta'
# sysfs AER counters
cat /sys/bus/pci/devices/<BDF>/aer_dev_correctable
cat /sys/bus/pci/devices/<BDF>/aer_dev_nonfatal
cat /sys/bus/pci/devices/<BDF>/aer_dev_fatal
# Dynamic debug, lab only
sudo sh -c 'echo 8 > /proc/sys/kernel/printk'
sudo sh -c "echo 'module pcieport +p' > /sys/kernel/debug/dynamic_debug/control"
sudo sh -c "echo 'module aer +p' > /sys/kernel/debug/dynamic_debug/control"
# NVIDIA checks
nvidia-smi -q | grep -i pcie -A 20
nvidia-smi topo -m
nvidia-smi dmon -s u
# Tracepoint consumption after kernel patch
sudo trace-cmd record -e pci_aer_extra:pci_aer_correctable_extra
sudo trace-cmd report