Linux Internals — Series II

Beyond
Memory Counters

The hidden subsystems that fail first under RDMA and high-performance networking pressure — PCIe health, IOMMU stalls, NIC transport stats, NUMA mismatches, and the full picture you need before the kernel OOM killer fires.

📅 2025 ⏱ ~25 min read 🔌 RDMA / HPC / Infra 🔧 Advanced

Introduction § 00

The previous installment in this series covered the most visible Linux failure modes — kernel panics, OOM kills, and lockups. Those are the loud catastrophic events: the kernel screaming, the process dying, the watchdog firing. But a cluster node under RDMA-intensive workloads can degrade badly and silently, long before any of those events trigger.

Consider a host running high-bandwidth GPU-to-GPU transfers over InfiniBand or RoCEv2. The kernel OOM killer never fires. No kernel panic in dmesg. Throughput simply collapses. Latency spikes. Jobs stall. Why? Because the pathology isn't in main memory — it's in a cascade: a PCIe AER correctable error storm causing a link retrain, which briefly pulls down DMA throughput, which backs up IOMMU mapping queues, which causes NIC CQ overflows, which triggers RNR NAKs, which causes exponential backoff in the RDMA transport. Each subsystem looks healthy in isolation. The cascade is only visible when you look at all of them together.

The most dangerous failures are the ones that don't look like failures — they look like performance.

This guide is the expanded observability map for those situations. It covers eleven subsystems, from the PCIe bus up through the software networking stack, with concrete commands, real log examples, and the causal chains that connect them.

Architecture Map — How These Subsystems Connect § 01

Before diving into each subsystem, it helps to see the full stack and where failures propagate:

Host subsystem stack — failure propagation paths

  ┌─────────────────────────────────────────────────────────────────────┐
  │  GPU / Accelerator                                                  │
  │  ├─ Device memory (VRAM, BAR1)  ├─ NVLink / PCIe throughput        │
  │  └─ Transfer stalls / XID errors                                    │
  └────────────────────────┬────────────────────────────────────────────┘
                           │ PCIe slot / Gen4/Gen5 lanes
  ┌────────────────────────▼────────────────────────────────────────────┐
  │  PCIe / AER layer                                                   │
  │  ├─ Corrected errors (ECRCs, replay timeouts) → link retrain risk   │
  │  ├─ Uncorrected fatal errors → device reset / panic                 │
  │  └─ Link retrains → DMA stalls, NIC blip, driver recovery           │
  └──────────────┬──────────────────────────────┬────────────────────── ┘
                 │                              │
  ┌──────────────▼───────────┐   ┌─────────────▼───────────────────────┐
  │  IOMMU / DMA layer       │   │  NIC (mlx5 / Mellanox / RoCEv2)     │
  │  ├─ DMAR faults          │   │  ├─ ethtool -S: rx/tx drops          │
  │  ├─ mapping invalidation │   │  ├─ RDMA: RNR NAKs, CQ overflows     │
  │  └─ VFIO passthrough     │   │  ├─ mlx5 health reporter             │
  └──────────────────────────┘   │  └─ firmware reset syndrome          │
                                 └─────────────┬───────────────────────┘
                                               │
  ┌────────────────────────────────────────────▼───────────────────────┐
  │  Linux Interrupt / Softirq layer                                   │
  │  ├─ /proc/interrupts: mlx5 IRQ affinity, storm detection           │
  │  ├─ /proc/softirqs: NET_RX, NET_TX, SCHED, RCU                     │
  │  └─ /proc/net/softnet_stat: drops, throttled, time_squeeze         │
  └────────────────────────────────────────────┬───────────────────────┘
                                               │
  ┌────────────────────────────────────────────▼───────────────────────┐
  │  CPU Scheduler / Pressure                                          │
  │  ├─ /proc/pressure/{cpu,memory,io}: PSI metrics                    │
  │  ├─ Run queue depth, load average, softirq %                       │
  │  └─ NUMA: local vs remote memory access, CPU/NIC/GPU affinity       │
  └────────────────────────────────────────────┬───────────────────────┘
                                               │
  ┌────────────────────────────────────────────▼───────────────────────┐
  │  cgroup / systemd-oomd                                             │
  │  ├─ MemoryHigh / MemoryMax events                                  │
  │  └─ systemd-oomd kills (before kernel OOM)                         │
  └────────────────────────────────────────────────────────────────────┘

Key insight: failures flow downward in hardware and upward in symptom visibility. A PCIe link retrain (hardware layer) causes a DMA stall (kernel driver layer) which causes NIC queue backup (transport layer) which appears as CPU softirq pressure and high run queue depth (scheduler layer). By the time you notice the CPU pressure, the root cause is four layers below it.

PCIe / AER — Bus Health § 02

PCIe Advanced Error Reporting (AER) is the hardware error reporting mechanism built into every modern PCIe device and root port. It is the single most underused diagnostic tool in high-performance Linux environments. AER errors arrive in dmesg and — if configured — are available through /sys/bus/pci/. They divide into three categories: corrected, uncorrected non-fatal, and uncorrected fatal.

Why PCIe Errors Matter for RDMA Workloads

RDMA NICs (Mellanox/NVIDIA ConnectX, Intel Omni-Path) generate enormous PCIe traffic. A single ConnectX-7 in HDR InfiniBand mode can sustain 200 Gb/s, which is near the maximum bandwidth of a ×16 PCIe Gen4 slot. At those bandwidths, even occasional corrected errors (which the hardware transparently retries) can cause enough retransmit latency to trigger RDMA timeout events or RNR NAKs at the transport layer.

Reading AER Events in dmesg

dmesg — PCIe AER corrected error

# Corrected error — hardware retried, but worth monitoring for frequency
[ 4521.003412] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:04:00.0
[ 4521.003428] 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 4521.003431]   device [15b3:101b] error status/mask=00000001/00002000
[ 4521.003434]     [ 0] RxErr           (First)

# Uncorrected Fatal — device will reset and driver will reload
[ 7823.441209] pcieport 0000:00:1c.0: AER: Uncorrected (Fatal) error received: 0000:04:00.0
[ 7823.441215] 0000:04:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer
[ 7823.441220]   device [15b3:101b] error status/mask=00040000/00000000
[ 7823.441223]     [18] MalTLP          (First)     ← Malformed TLP: hardware bug or signal integrity
[ 7823.441230] 0000:04:00.0: Device recovery failed
[ 7823.441235] mlx5_core 0000:04:00.0: Unrecoverable error: removing device

Link Retrains — The Hidden DMA Stall

A link retrain occurs when the PCIe physical layer renegotiates the link — typically after signal integrity problems, thermal events, or corrected error accumulation. During a retrain, the device is temporarily offline. For a NIC, this means all in-flight DMA operations stall. For RDMA, all active QPs (Queue Pairs) see a timeout and may trigger error recovery.

dmesg — PCIe link retrain

[ 4521.445001] pcieport 0000:00:1c.0: PCIe Link retrained: speed=32GT/s width=x16
[ 4521.445088] mlx5_core 0000:04:00.0: communication channel timeout, will try PCIe reset
[ 4521.510023] mlx5_core 0000:04:00.0: device recovered after PCIe link retrain
# Gap of ~65ms — during which all DMA on this NIC was stalled

Monitoring and Tooling

PCIe AER monitoring commands

# List all PCIe errors via AER driver sysfs
find /sys/bus/pci/devices -name "aer_*_count" -exec sh -c \
  'echo "$1: $(cat $1)"' _ {} \; 2>/dev/null | grep -v ": 0$"

# Use rasdaemon for structured AER logging to SQLite
apt install rasdaemon
systemctl enable --now rasdaemon
ras-mc-ctl --summary        # summary of all hardware errors
ras-mc-ctl --errors          # full error table with timestamps

# Check link speed and width — unexpected downgrade indicates link retrain
lspci -vv -s 0000:04:00.0 | grep -E "LnkSta:|LnkCap:"
# LnkCap: Speed 32GT/s (ok), Width x16 (ok)
# LnkSta: Speed 16GT/s (!), Width x8 (!)  ← DOWNGRADED — signal integrity issue

# Monitor link state changes in real time
watch -n1 'lspci -vv -s 0000:04:00.0 | grep LnkSta'

# Trigger AER error injection for testing (kernel must have CONFIG_PCIEAER_INJECT)
modprobe aer-inject
# Not recommended on production — causes device reset

⚠ thermal correlation

PCIe corrected error rates often spike during thermal events. If you see periodic AER corrected error bursts, cross-reference with sensors or IPMI thermal data. A ConnectX NIC running at 85°C+ will produce more signal integrity errors. Always check whether AER error timestamps correlate with fan speed changes or compute workload peaks.

IOMMU / DMA Mapping Pressure § 03

The IOMMU (Input-Output Memory Management Unit) — Intel VT-d or AMD-Vi — sits between PCIe devices and physical memory. Every DMA operation by a NIC, GPU, or NVMe device goes through the IOMMU, which translates device-visible addresses (IOVAs) to physical addresses. Under sustained DMA load, IOMMU resources can become a bottleneck that's invisible from both the device and the CPU perspective.

IOMMU Failure Modes

DMAR faults occur when a device attempts a DMA to an address that has no valid IOMMU mapping. This is usually a driver bug (mapping freed before DMA completes) but can also indicate memory corruption. IOVA exhaustion occurs when the pool of available IOMMU virtual addresses runs out — common on systems with many devices doing many small DMA operations simultaneously. Invalidation stalls occur when the CPU issuing TLB invalidations to the IOMMU takes longer than expected — this blocks new DMA mappings and can serialize normally-parallel DMA operations.

dmesg — IOMMU/DMA fault messages

# DMAR fault — device tried to DMA to unmapped address
[ 8901.234567] DMAR: DRHD: handling fault status reg 2
[ 8901.234570] DMAR: [DMA Read] Request device [04:00.0] fault addr 0xffff800000000000
               IOMMU perfmon counter overflow  ← Too many DMAR events
               reason: Present bit in pasid table entry is clear

# IOMMU mapping failure under load
[ 9012.445801] mlx5_core 0000:04:00.0: DMA map failed for iova 0x100000000, size 4096
[ 9012.445810] mlx5_core 0000:04:00.0: IOVA space exhausted, consider increasing iommu_iova_rcache_size

# On AMD IOMMU:
[ 5432.112233] AMD-Vi: Event logged [IO_PAGE_FAULT device=04:00.0 domain=0x0019
               address=0x00000000beef1234 flags=0x0000]

IOMMU Tuning for High-DMA Workloads

IOMMU performance tuning

# Check if IOMMU is active and in what mode
dmesg | grep -i iommu | head -20
# DMAR: IOMMU enabled
# DMAR: Intel(R) Virtualization Technology for Directed I/O

# Check current IOMMU groups (each group shares IOMMU domain)
find /sys/kernel/iommu_groups -type l | sort -V | head -30

# For RDMA workloads without VMs: consider iommu=pt (passthrough mode)
# This disables address translation for physical machines, removing IOMMU overhead
# Add to kernel cmdline in GRUB:
# intel_iommu=on iommu=pt

# Monitor IOMMU performance counters (if supported)
cat /sys/kernel/debug/iommu/intel/iommu_perf  # Intel IOMMU perf stats

# Check for IOVA cache tuning (Linux 5.15+)
cat /sys/module/iommu/parameters/iova_cookie_pool_size

# Force strict IOMMU (no lazy invalidation) — use for debugging only
# iommu.strict=1 in kernel cmdline

ℹ passthrough vs strict

For bare-metal HPC nodes that don't use VMs or VFIO, iommu=pt (passthrough) is the standard recommendation from Mellanox/NVIDIA for ConnectX deployments. It eliminates IOMMU translation overhead entirely. However, if you're using SR-IOV VFs or any VFIO passthrough to VMs, you must use full IOMMU translation. Mixing passthrough and strict modes on the same host with VFIO is a common misconfiguration.

VFIO / Virtualization Path § 04

VFIO (Virtual Function I/O) is the kernel framework that enables PCIe device passthrough to virtual machines, and also underlies userspace DMA in DPDK and SPDK. When a NIC or GPU is passed through to a VM via VFIO, the IOMMU enforces isolation between guest DMA and host memory — but every VM exit, balloon event, or guest memory map change creates work for the host kernel.

Key VFIO Failure Modes

VM exits on every DMA: In older configurations without posted interrupts or without nested page table support, every DMA operation can cause a VM exit — catastrophic for RDMA where DMA completions are the critical path. Memory ballooning interference: If the guest balloon driver reclaims pages that were pinned for DMA, it creates IOMMU mapping invalidation pressure on the host. VFIO group resets: Any error in the VFIO group (e.g., a fatal PCIe error on one VF) can trigger a full function-level reset that affects all VFs in the group.

monitoring VFIO events

# Watch for VFIO-related events in dmesg
dmesg -w | grep -iE "vfio|vm exit|balloon|iommu|passthrough"

# Check VFIO device state
ls -la /dev/vfio/
cat /sys/bus/pci/drivers/vfio-pci/*/reset_method  # FLR, bus, or hot-reset

# Monitor VM exit rates via KVM perf counters
perf kvm stat live -p <qemu_pid>
# Output shows VM exit reasons and frequency:
# Reason                    Count        Frequency
# EXTERNAL_INTERRUPT         18291234     45.2%
# HLT                         8123456     20.1%
# EPT_VIOLATION               1024511      2.5%   ← watch this under DMA load

# Check if SR-IOV VFs are configured correctly
cat /sys/bus/pci/devices/0000:04:00.0/sriov_numvfs
cat /sys/bus/pci/devices/0000:04:00.0/sriov_totalvfs

# VFIO IOMMU group membership (all devices in same group share fate)
readlink -f /sys/bus/pci/devices/0000:04:00.*/iommu_group

NIC Queue and RDMA Transport Health § 05

This is the subsystem closest to the actual RDMA workload. Mellanox/NVIDIA ConnectX NICs expose an enormous set of statistics through both ethtool and the rdma / perfquery toolchain. The trick is knowing which counters matter and what the causal chains are between them.

ethtool Statistics — The NIC's Own Counters

ethtool -S — key counters explained

ethtool -S mlx5_0 | grep -vE "^\s+[a-z_]+: 0$"  # only non-zero stats

# ── Receive path ──────────────────────────────────────────────
rx_out_of_buffer: 1423      # ← CRITICAL: NIC had no receive buffers, dropped pkts
rx_buff_alloc_err: 42       # RX buffer allocation failed (memory pressure on host)
rx_cqe_err: 7               # Completion Queue Entry error — hardware/protocol error
rx_wqe_err: 3               # Work Queue Entry error (WQE posted incorrectly)

# ── Transmit path ─────────────────────────────────────────────
tx_queue_stopped: 891       # TX queue was paused (backpressure — flow control fired)
tx_timeout: 2               # ← CRITICAL: TX timeout, driver reset the queue
tx_cqe_err: 0               # Completion errors on transmit (0 is good)

# ── PFC (Priority Flow Control) — RoCEv2 specific ─────────────
rx_prio0_pause: 0
rx_prio3_pause: 88421       # ← Received pause frames on priority 3 (RDMA traffic class)
                             # High count = downstream congestion is pausing our TX
tx_prio3_pause: 12033       # ← We are sending pause frames — we're congested!

# ── PCIe / DMA ────────────────────────────────────────────────
rx_pci_signal_integrity: 3  # PCIe signal integrity errors (correlates with AER)
tx_pci_signal_integrity: 1
rx_pci_transport_nonfatal_err: 2

# ── Health ────────────────────────────────────────────────────
rx_discards_phy: 0
tx_discards_phy: 0
link_down_events_phy: 1     # Physical link went down once (correlates with link retrain)

RDMA/InfiniBand Counters — perfquery and rdmatool

RDMA transport counters

# InfiniBand counters via perfquery (from infiniband-diags)
perfquery -x 0 1   # port 1, xmit + recv counters

# Or via sysfs — same data, scriptable
PORT=/sys/class/infiniband/mlx5_0/ports/1/counters
echo "=== RDMA Port Counters ==="
for f in $PORT/*; do
  val=$(cat $f)
  [ "$val" != "0" ] && echo "$(basename $f): $val"
done

# Key counters to watch:
#
# VL15Dropped       — congestion on VL15 (management traffic)
# ExcessiveBufferOverrunErrors — RX buffer overrun (too slow to consume)
# PortXmitDiscards  — ← CRITICAL: packets discarded on TX (congestion)
# PortRcvErrors     — Physical/link errors on receive
# PortRcvRemotePhysicalErrors — signal integrity on the remote end
# SymbolErrorCounter — bit errors on the wire
# LinkErrorRecoveryCounter — link recovery events
# LinkDownedCounter — full link down events

# RoCEv2 / RDMA extended counters (kernel 5.x+)
PORT_XSTATS=/sys/class/infiniband/mlx5_0/ports/1/ext_stats
cat $PORT_XSTATS/RoceRetransmits          # RDMA retransmits (timeout-based)
cat $PORT_XSTATS/RoceSlowRestart          # Go-back-N slow restart events
cat $PORT_XSTATS/RNRNAKRetries            # ← CRITICAL: receiver-not-ready NAKs

# mlx5 diagnostic counters via rdma tool (iproute2)
rdma statistic show mode auto            # show all QP statistics
rdma statistic qp show                   # per-QP stats

Understanding RNR NAKs — The RDMA Cascade Trigger

An RNR (Receiver Not Ready) NAK is sent when an RDMA message arrives at a QP but the receiver hasn't posted a receive Work Queue Entry (WQE) to accept it. The sender must wait for the RNR timeout (configurable, typically 1ms to 491ms in IB spec) before retrying. Under load, this creates a cascade: slow application posting WQEs → RNR NAKs → sender backs off → throughput drops → application falls further behind on WQE posting.

RNR NAK cascade — causal chain

  Application under CPU pressure
      │
      └─ Slow to post Receive WQEs to QP
             │
             └─ Incoming RDMA message arrives; no WQE available
                    │
                    └─ NIC sends RNR NAK to sender
                           │
                           ├─ Sender backs off (RNR timer: 1ms–491ms)
                           │
                           ├─ In-flight messages queue up in sender NIC
                           │
                           ├─ NIC TX queue backpressure → tx_queue_stopped↑
                           │
                           └─ PCIe DMA backs up → DMA latency↑
                                  │
                                  └─ CPU softirq budget exceeded
                                         │
                                         └─ /proc/net/softnet_stat time_squeeze↑

The fix depends on the root cause: if RNR NAKs are due to CPU pressure, fix the scheduler affinity or increase the RNR timer on the QP. If they're due to application behavior, increase the pre-posted WQE depth or use SRQ (Shared Receive Queues).

Interrupt Behavior — IRQ Affinity and Storms § 06

Modern NICs generate one interrupt per receive queue completion, and a high-bandwidth NIC under load can generate millions of interrupts per second. How those interrupts are distributed across CPUs — and whether the distribution is sane — directly determines networking latency and throughput.

Reading /proc/interrupts

/proc/interrupts — IRQ distribution analysis

# Show all mlx5 interrupts and their per-CPU counts
grep mlx5 /proc/interrupts
#           CPU0     CPU1     CPU2     CPU3  ... CPU31
#  124:  18234120   892341   871234   912456 ...  891023  IR-PCI-MSI mlx5_core@pci:0000:04:00.0
#  125:     12345 17891234  1023456   982341 ...  123456  IR-PCI-MSI mlx5_core@pci:0000:04:00.0
#                ↑                                        ↑
#         CPU0 taking 18M vs CPU1 taking 0.8M — BAD SKEW

# Total interrupt count per CPU (shows imbalance)
awk 'NR==1{next} /mlx5/{for(i=2;i<=NF-3;i++) sum[i]+=$i} END{for(i in sum) print "CPU"(i-2)": "sum[i]}' \
  /proc/interrupts | sort -t: -k2 -rn | head -8

# Check current IRQ affinity for mlx5 queues
for irq in $(grep mlx5 /proc/interrupts | awk '{print $1}' | tr -d ':'); do
  echo "IRQ $irq affinity: $(cat /proc/irq/$irq/smp_affinity_list 2>/dev/null)"
done

IRQ Affinity Tuning

The goal for RDMA/high-bandwidth NICs is to pin each NIC receive queue's interrupt to a dedicated CPU core, ideally one on the same NUMA node as the NIC. This avoids cross-NUMA interrupt delivery (which incurs remote memory latency) and prevents any single CPU from becoming the interrupt bottleneck.

setting IRQ affinity for mlx5 NIC queues

# Use the mlnx-tools irqbalancer or manual approach:

# Find which NUMA node the NIC is on
cat /sys/bus/pci/devices/0000:04:00.0/numa_node
# 0  → NUMA node 0

# Find CPUs on NUMA node 0
cat /sys/devices/system/node/node0/cpulist
# 0-31  → CPUs 0–31 are on NUMA node 0

# Assign each mlx5 IRQ to successive CPUs on node 0
CPU=0
grep mlx5 /proc/interrupts | awk -F: '{print $1}' | while read IRQ; do
  echo $CPU > /proc/irq/$IRQ/smp_affinity_list
  CPU=$(( (CPU + 1) % 32 ))
done

# Disable irqbalance for mlx5 (it will undo your settings)
systemctl mask irqbalance
# OR configure irqbalance to exclude your mlx5 IRQs:
# IRQBALANCE_BANNED_CPUS=00000000,0000ffff  in /etc/default/irqbalance

# Verify with watch
watch -n1 'grep mlx5 /proc/interrupts | awk "{print \$1, \$2, \$NF}"'

Interrupt Storm Detection

detecting and diagnosing interrupt storms

# Check total interrupt rate per second (high rate = storm)
watch -n 0.5 'grep -c "^" /proc/interrupts'  # changes if new IRQs appear

# Use sar to track interrupt rate over time
sar -I ALL 1 10 | grep -v "^$" | grep -v "^Average" | sort -k3 -rn | head -5

# perf top for IRQ context — see what's running in interrupt handlers
perf top -e irq:irq_handler_entry --sort comm,sym

# Check softirq counts — NET_RX is the primary indicator of NIC storm
watch -n1 'grep -E "NET_RX|NET_TX|SCHED" /proc/softirqs'

Softirq / Networking Stack § 07

Linux uses softirqs (software interrupts) to defer non-urgent interrupt work from hard interrupt context to a slightly lower-priority processing context. The networking stack's receive and transmit paths are both softirq-based: NET_RX handles incoming packet processing, and NET_TX handles transmit queue wakeups. Under sustained load, these softirqs compete with application threads for CPU time.

The Three Canonical Softnet Stats

/proc/net/softnet_stat — annotated

# Each line = one CPU. Columns are hex values.
cat /proc/net/softnet_stat
# total     dropped  time_sq  0        0        0        0        0        0       throttle  rx_rps  flow_lim_count
# 01234567  00000000 0000001a  ...

# Human-readable parsing script:
awk '{
  printf "CPU%d: total=%d dropped=%d time_squeeze=%d throttled=%d\n",
    NR-1,
    strtonum("0x"$1),
    strtonum("0x"$2),
    strtonum("0x"$3),
    strtonum("0x"$10)
}' /proc/net/softnet_stat

# What each counter means:
#
# total       — total frames processed by this CPU's NET_RX softirq
# dropped     — ← CRITICAL: frames dropped because netif_receive_skb
#                queue was full. Each count = packet loss.
# time_squeeze — ← IMPORTANT: softirq budget exhausted before queue empty.
#                Means more work arrived than the NAPI budget allowed.
#                High values mean you need bigger net.core.netdev_budget
#                or more CPUs/queues.
# throttled   — RPS (Receive Packet Steering) throttle events

NAPI Budget Tuning

tuning NAPI and softirq budget

# Current NAPI budget (packets processed per NET_RX softirq poll cycle)
sysctl net.core.netdev_budget
# net.core.netdev_budget = 300  ← default

# For high-throughput RDMA hosts, increase to 600-1200
sysctl -w net.core.netdev_budget=1200
sysctl -w net.core.netdev_budget_usecs=8000  # time budget companion (us)

# Also increase RX/TX ring buffer sizes on the NIC
ethtool -g mlx5_0          # current ring sizes
ethtool -G mlx5_0 rx 8192 tx 8192  # maximize buffers

# Monitor softirq time per CPU with mpstat
mpstat -I SCPU 1 5 | grep -E "CPU|soft"
# CPU    %soft
# 0      45.2   ← 45% of CPU0 time in softirq context — dangerously high
# 1       2.1
# ...
# CPU0 is your softirq bottleneck — fix IRQ affinity first

# Use sar for historical softirq breakdown
sar -u ALL 1 10 | head -20
sar -n SOCK 1 5   # socket statistics
nstat -az         # netstat counters in one shot (from iproute2)

RPS/RFS — Steering Packets to the Right CPU

RPS (Receive Packet Steering) and RFS (Receive Flow Steering) are kernel features that distribute packet processing across CPUs in software — useful when the NIC has fewer hardware queues than CPU cores. For RDMA workloads, RPS is typically not used (RDMA bypasses the kernel IP stack), but for the accompanying RoCEv2 control traffic or any kernel-bypass datapath that falls back to the kernel stack, RPS configuration matters.

configuring RPS for multi-queue NICs

# Check how many queues the NIC has
ethtool -l mlx5_0
# Current hardware settings:
# RX: 16
# TX: 16

# For a 128-core system, enable RPS to spread across all CPUs
for rx in /sys/class/net/mlx5_0/queues/rx-*/rps_cpus; do
  # Set to all 128 CPUs (4x 32-bit hex bitmask for 128 CPUs)
  echo ffffffffffffffffffffffffffffffff > $rx
done

# Enable RFS (steers packets to the CPU running the receiving socket)
sysctl -w net.core.rps_sock_flow_entries=32768
for rx in /sys/class/net/mlx5_0/queues/rx-*/rps_flow_cnt; do
  echo 2048 > $rx
done

CPU Scheduler Pressure § 08

Networking stress — especially RDMA completion processing and userspace polling loops — manifests in the CPU scheduler before it shows up as packet drops or OOM events. PSI (Pressure Stall Information) is the modern, authoritative way to measure this.

PSI — Pressure Stall Information

Available since Linux 4.20, PSI measures the fraction of time in which tasks were stalled waiting for a resource. For CPU: stalled means runnable but not running. For memory: stalled means waiting for a page fault or reclaim. For I/O: stalled means waiting for I/O to complete.

/proc/pressure — reading PSI

cat /proc/pressure/cpu
# some avg10=2.45 avg60=1.89 avg300=0.92 total=1234567890
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
#
# "some" = at least one task was stalled
# "full" = ALL runnable tasks were stalled (CPU completely unavailable)
# avg10/60/300 = exponential moving average over 10s/60s/5min (as %)
# total = cumulative microseconds of stall time

cat /proc/pressure/memory
# some avg10=12.44 avg60=8.32 avg300=3.11 total=98765432
# full avg10=4.21  avg60=2.10 avg300=0.88 total=23456789
#
# avg10=12.44% means: 12% of the last 10s, at least one task
# was stalled waiting for memory. That's significant.

cat /proc/pressure/io
# some avg10=0.22 avg60=0.18 avg300=0.05 total=112233

# Practical thresholds for alerts:
# cpu.some.avg10    > 30%  → CPU saturation
# memory.some.avg10 > 10%  → memory pressure worth investigating
# memory.full.avg10 > 1%   → severe memory pressure, OOM risk
# io.some.avg10     > 10%  → I/O bottleneck

Run Queue Depth and Load Average

run queue and load analysis

# vmstat: r column is run queue depth (processes wanting CPU right now)
vmstat 1 10
# procs ---memory--- ---swap-- -----io---- -system-- ------cpu-----
#  r  b  swpd  free   si   so  bi   bo     in   cs  us sy id wa st
# 48  2     0 892344   0    0   0    12   89234 231234 24 31 44  0  0
# ↑                                        ↑          ↑  ↑
# r=48 with 32 CPUs → 16 tasks waiting    high irqs  hi sys = softirq heavy

# sar for historical run queue data
sar -q 1 10
# runq-sz  plist-sz  ldavg-1  ldavg-5  ldavg-15  blocked
#      42      1523    40.22    35.18     22.41        8
# 42 tasks in run queue with 32 CPUs → CPU-saturated

# Check scheduler statistics for specific process
cat /proc/<pid>/schedstat
# 1234567890 987654321 4521
# ^time on CPU (ns) ^wait time (ns) ^involuntary context switches
# High wait/run ratio = process is CPU-starved (scheduler pressure)

NUMA Locality — The Topology Tax § 09

On multi-socket servers and systems with PCIe-attached GPUs, NUMA topology is one of the most impactful and most commonly ignored performance factors. Every memory access, DMA operation, and interrupt has a home node. Crossing NUMA node boundaries adds 30–120ns of latency per access. At RDMA bandwidths, this compounds rapidly.

The Three Topology Mismatches

CPU-NIC mismatch: Application threads running on NUMA node 1, while the NIC is on NUMA node 0's PCIe root complex. Every receive buffer posted by the application and every DMA by the NIC crosses the QPI/UPI interconnect. GPU-NIC mismatch: GPU on one NUMA node, NIC on another. GPUDirect RDMA depends on both being on the same NUMA node to avoid the CPU bottleneck entirely. Memory allocation mismatch: RDMA registration pins physical pages. If those pages are on a remote NUMA node (because the allocating process ran on a different node), every DMA access pays remote memory latency.

NUMA topology example — 2-socket server with GPU and NIC

  Socket 0 (NUMA Node 0)              Socket 1 (NUMA Node 1)
  ┌────────────────────────┐           ┌────────────────────────┐
  │ CPUs: 0–31             │           │ CPUs: 32–63            │
  │ RAM: 256 GB            │  ←QPI→   │ RAM: 256 GB            │
  │                        │           │                        │
  │ PCIe Root Complex 0    │           │ PCIe Root Complex 1    │
  │ ├─ NIC (ConnectX-7)   │           │ ├─ GPU 0 (A100)        │
  │ └─ NVMe 0              │           ├─ GPU 1 (A100)          │
  └────────────────────────┘           └────────────────────────┘

  MISMATCH: GPU is on Node 1, NIC is on Node 0.
  GPUDirect RDMA must cross the QPI link for every transfer.
  → Bandwidth limited to ~60 GB/s (QPI) instead of ~100 GB/s (PCIe)
  → Latency penalty ~80ns per DMA access

  CORRECT: Move either the NIC to Node 1's PCIe slot,
  OR ensure the NIC QP buffers are allocated on Node 1 memory.

NUMA Monitoring Commands

NUMA topology and access monitoring

# Show NUMA topology overview
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 ... 31
# node 0 size: 257842 MB
# node 0 free: 189234 MB
# node distances:
# node   0   1
#   0:  10  21   ← local=10, remote=21 (2.1× penalty)
#   1:  21  10

# Show current NUMA statistics — remote memory accesses
numastat
#                            node0    node1
# numa_hit                12345678  9876543
# numa_miss                  89234     1234  ← processes got remote memory
# numa_foreign               1234    89234
# interleave_hit              4567      891
# local_node              12256444  9875309
# other_node                 89234     1234  ← !! cross-NUMA allocation

# Show per-process NUMA memory usage
numastat -p <pid>
numastat -s              # sort by total memory

# Where is the NIC on the NUMA topology?
cat /sys/bus/pci/devices/0000:04:00.0/numa_node

# Pin a process to the correct NUMA node (where the NIC is)
numactl --cpunodebind=0 --membind=0 ./my_rdma_app

# For MPI / HPC: use --membind and --cpubind together
numactl --cpunodebind=0 --membind=0 mpirun -np 32 ./collective_op

# perf stat for NUMA events (needs PMU access)
perf stat -e numa:numa_miss,numa:local_node,uncore_cha/event=0x35/ -a sleep 5

🚨 GPUDirect RDMA topology requirement

GPUDirect RDMA (nvidia_peermem or nv_peer_mem) requires the GPU and the RDMA NIC to be on the same PCIe root complex or at minimum the same NUMA node. If they're on different sockets, GPUDirect will still "work" but all transfers will cross the UPI/QPI link, capping bandwidth at the interconnect speed (~50–80 GB/s) rather than PCIe Gen4 ×16 speed (~64 GB/s unidirectional from the NIC's perspective). Use nvidia-smi topo -m to see the exact topology matrix before deploying RDMA workloads.

GPU-Side Corroboration § 10

When debugging RDMA performance problems on GPU-accelerated nodes, the GPU has its own visibility into what's happening. GPU-side counters can confirm or contradict what the host networking stack is reporting — a "healthy GPU" with a "sick NIC" is a different problem from both being sick.

Key GPU Metrics to Track

GPU monitoring for RDMA corroboration

# nvidia-smi: comprehensive GPU state
nvidia-smi dmon -s pucvmet -d 1 -c 30
# Fields: power, utilization, clock, memory, ECC, thermal, NVLINK

# BAR1 usage — critical for GPUDirect RDMA
# BAR1 is the GPU's PCIe aperture for peer mapping; if full, new RDMA
# registrations will fail even if GPU VRAM has free space
nvidia-smi --query-gpu=bar1.used,bar1.free,bar1.total --format=csv,noheader
# 12288 MiB, 4096 MiB, 16384 MiB
# bar1.free = 4096 MiB — getting low if multiple processes register buffers

# GPU memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

# NVLink throughput (for multi-GPU nodes with NVLink)
nvidia-smi nvlink -g 0 -s  # throughput counters for GPU 0

# XID errors — GPU-side error events (mapped to dmesg)
dmesg | grep -i "NVRM: Xid"
# NVRM: Xid (PCI:0000:41:00): 79, ...  → GPU has fallen off the bus
# NVRM: Xid (PCI:0000:41:00): 48, ...  → Double Bit ECC Error
# NVRM: Xid (PCI:0000:41:00): 45, ...  → Preemptive channel removal

# CUDA memory registration failures (userspace, check app logs)
# cuMemAddressReserve: CUDA_ERROR_OUT_OF_MEMORY → BAR1 or VRAM full

# PCIe throughput counter via nvidia-smi
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current \
           --format=csv,noheader
# 4, 16   ← PCIe Gen4, x16 — if it says "3, 8", the link downgraded

Topology Verification

nvidia-smi topology matrix

nvidia-smi topo -m
#         GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
# GPU0     X      NV4     NV4     NV4     PIX     0-31
# GPU1    NV4      X      NV4     NV4     PIX     0-31
# GPU2    NV4     NV4      X      NV4     PHB     32-63
# GPU3    NV4     NV4     NV4      X      PHB     32-63
# mlx5_0  PIX     PIX     PHB     PHB      X      0-31
#
# NV4  = connected via NVLink with 4 links (best)
# PIX  = connected via PCIe same root complex (good — GPUDirect optimal)
# PHB  = connected via PCIe host bridge (cross-NUMA — GPUDirect suboptimal)
#
# GPU2 and GPU3 show PHB to mlx5_0 → they will underperform in RDMA

Firmware / Device Reset Signals § 11

RDMA NICs — particularly Mellanox/NVIDIA ConnectX series — expose a rich set of firmware-level health signals that sit below the Linux driver. These signals often appear before the driver-level counters show problems, making them the best early warning system for NIC health.

mlx5 Health Reporter

mlx5 devlink health reporter

# List all health reporters for the mlx5 device
devlink health show pci/0000:04:00.0
# pci/0000:04:00.0:
#   reporter fw_fatal
#     state healthy  error 0  recover 0  grace_period 500  auto_recover true
#   reporter fw
#     state healthy  error 0  recover 0  grace_period 500  auto_recover true
#   reporter rx
#     state healthy  error 0  recover 0  grace_period 500  auto_recover true
#   reporter tx
#     state error  error 3  recover 2   ← TX reporter has errors!
#                                          NIC attempted 2 self-recoveries

# Dump the health report from the TX reporter
devlink health dump show pci/0000:04:00.0 reporter tx
# Shows: which SQ (send queue) errored, error syndrome, timestamp

# Manually trigger a health report (non-destructive)
devlink health diagnose pci/0000:04:00.0 reporter fw

# Force recovery if auto-recover is disabled
devlink health recover pci/0000:04:00.0 reporter tx

Firmware Fatal Syndrome Logs

dmesg — mlx5 firmware syndrome messages

# Fatal syndrome: NIC firmware detected an unrecoverable internal error
[ 12345.678901] mlx5_core 0000:04:00.0: Fatal FW syndrome 0x1FF1FF reported
[ 12345.678910] mlx5_core 0000:04:00.0: internal state dump:
[ 12345.678920] mlx5_core 0000:04:00.0: 0x00000000 0x00000000 0xff1ff2aa ...
[ 12345.678950] mlx5_core 0000:04:00.0: device is going to be reset

# Non-fatal health check failure
[ 9876.543210] mlx5_core 0000:04:00.0: health: nic state:2 sxp:0 rxp:0 sxb:0 rxb:0
[ 9876.543215] mlx5_core 0000:04:00.0: Unexpected event sxp timeout on port 1

# Grabbing syndrome codes for Mellanox support
dmesg | grep -i "syndrome\|fatal\|fw_fatal\|mlx5.*error\|mlx5.*reset" | tail -50

# mst (Mellanox Software Tools) for firmware-level access
# apt install mstflint
mst start
mlxfwmanager --query         # current FW version
mlxfwreset   -d /dev/mst/mt4123_pciconf0 query  # reset capability

# Check if NIC firmware is up to date
mlxfwmanager --online-query-psid  # compare to latest available FW

cgroup / systemd-oomd Layer § 12

Modern Linux distributions ship with systemd-oomd enabled by default. This daemon monitors PSI memory pressure across cgroups and kills services before the kernel OOM killer fires. On RDMA workloads, this creates a subtle trap: your job might be killed by systemd-oomd rather than the kernel OOM killer, with completely different logging and no vmcore dump.

systemd-oomd Kill Events

detecting systemd-oomd kills

# Check if systemd-oomd is running
systemctl status systemd-oomd

# View oomd logs
journalctl -u systemd-oomd -f

# oomd kill event in journal:
# Killed /system.slice/myapp.service due to memory pressure:
#   memory pressure: 89.45% 10sec avg, kill threshold: 60%
#   swap usage: 98.2% above 90% threshold
#   killed process: myapp (pid 9821) with SIGKILL
#   freed: 8192 MB

# Monitoring oomd decisions in real time
oomctl     # systemd-oomd's CLI tool (systemd 247+)

# Check per-cgroup memory pressure PSI
cat /sys/fs/cgroup/system.slice/myapp.service/memory.pressure
# some avg10=78.43 avg60=45.23 avg300=12.11 total=9876543210
# full avg10=23.11 avg60=15.67 avg300=4.89 total=2345678901
# This is the signal oomd uses — before kernel OOM fires

# Current cgroup memory usage
cat /sys/fs/cgroup/system.slice/myapp.service/memory.current
cat /sys/fs/cgroup/system.slice/myapp.service/memory.max
cat /sys/fs/cgroup/system.slice/myapp.service/memory.events
# low 0
# high 234        ← hit MemoryHigh limit 234 times (throttled + reclaim)
# max 12          ← hit MemoryMax limit 12 times
# oom 0
# oom_kill 0

Configuring oomd Thresholds

/etc/systemd/oomd.conf — tuning oomd for RDMA workloads

[OOM]
# Swap usage threshold before considering killing (default: 90%)
SwapUsedLimit=90%

# Memory pressure (PSI) threshold over 10s avg to consider killing
DefaultMemoryPressureLimit=60%

# How long to wait at/above the pressure limit before acting
DefaultMemoryPressureDurationSec=30s

# For RDMA workloads: increase duration and pressure threshold
# to avoid premature kills during burst allocation phases:
# DefaultMemoryPressureLimit=80%
# DefaultMemoryPressureDurationSec=60s

✅ exempt critical services from oomd

For services that absolutely must not be killed by oomd — RDMA daemons, MPI managers, or persistent communication endpoints — set ManagedOOMMemoryPressure=kill and ManagedOOMSwap=skip in their unit file, or use OOMPolicy=continue to prevent kill propagation. However, be careful: exempting too many services defeats oomd's purpose and means you'll fall back to the kernel OOM killer anyway, just later and with more damage.

Synthesis — A Unified Monitoring Playbook § 13

All of these subsystems emit signals, but they're only useful if you have a coherent playbook for correlating them. Here's how the most common RDMA-under-pressure failure chains look when you instrument all the layers simultaneously:

The "Throughput Cliff" Pattern

RDMA throughput collapses suddenly rather than degrading gradually. Check in this order:

#	Layer	Signal	Command
1	PCIe	AER corrected error burst or link retrain	`dmesg \| grep -E "AER\|LnkSta\|retrain"`
2	NIC Health	mlx5 health reporter `tx` in error state	`devlink health show`
3	NIC Stats	`tx_timeout` or `link_down_events_phy` > 0	`ethtool -S mlx5_0`
4	RDMA	`RNRNAKRetries` or `PortXmitDiscards` increasing	`cat /sys/class/infiniband/mlx5_0/ports/1/counters/*`
5	IOMMU	DMAR fault or DMA map failure	`dmesg \| grep -i "DMAR\\|iommu\\|DMA map"`

The "Slow Creep" Pattern

Performance degrades over hours. Usually a resource leak or NUMA drift. Check:

#	Layer	Signal	Command
1	Memory	`memory.some.avg10` slowly rising over time	`cat /proc/pressure/memory`
2	NUMA	`numa_miss` / `other_node` counts increasing	`numastat`
3	GPU BAR1	`bar1.free` approaching 0	`nvidia-smi --query-gpu=bar1.free`
4	Softirq	`time_squeeze` increasing on specific CPU	`watch -n1 'awk ... /proc/net/softnet_stat'`
5	cgroup	`memory.events high` counter rising	`cat .../memory.events`

A Minimal One-Shot Diagnostic Script

rdma-health-check.sh — collect all signals at once

#!/bin/bash
# Collect all RDMA-adjacent health signals in one snapshot
NIC="mlx5_0"
PCI="0000:04:00.0"
IB_PORT="/sys/class/infiniband/${NIC}/ports/1"
TS=$(date +%Y%m%d_%H%M%S)

echo "=== RDMA Health Snapshot: $TS ==="

echo "\n--- PCIe Link State ---"
lspci -vv -s $PCI 2>/dev/null | grep -E "LnkSta:|LnkCap:"

echo "\n--- PCIe AER Error Counts ---"
find /sys/bus/pci/devices/$PCI -name "aer_*_count" 2>/dev/null | \
  xargs -I{} sh -c 'v=$(cat {}); [ "$v" != "0" ] && echo "$(basename {}): $v"'

echo "\n--- NIC Health Reporters ---"
devlink health show pci/$PCI 2>/dev/null

echo "\n--- RDMA Port Counters (non-zero) ---"
for f in $IB_PORT/counters/* $IB_PORT/ext_stats/*; do
  v=$(cat $f 2>/dev/null)
  [ "$v" != "0" ] && [ "$v" != "" ] && echo "  $(basename $f): $v"
done

echo "\n--- ethtool Errors (non-zero) ---"
ethtool -S $NIC 2>/dev/null | grep -vE "^\s+[a-z_]+: 0$" | head -30

echo "\n--- PSI Pressure ---"
cat /proc/pressure/{cpu,memory,io}

echo "\n--- Softnet Stats ---"
awk '{printf "CPU%d: total=%d dropped=%d squeeze=%d\n",NR-1,strtonum("0x"$1),strtonum("0x"$2),strtonum("0x"$3)}' \
  /proc/net/softnet_stat | grep -v "dropped=0 squeeze=0"

echo "\n--- NUMA Stats ---"
numastat 2>/dev/null | head -15

echo "\n--- GPU BAR1 + Topology ---"
nvidia-smi --query-gpu=name,bar1.used,bar1.free,pcie.link.gen.current,pcie.link.width.current \
  --format=csv 2>/dev/null

echo "\n--- Recent dmesg (errors only) ---"
dmesg --level=err,crit,alert,emerg --since="10 minutes ago" 2>/dev/null | tail -30

END

The hosts that fail silently are the hardest to debug — and they're the ones where every individual subsystem looks "fine" until you look at all of them together and see the cascade. The goal of this monitoring stack is not to generate more dashboards; it's to make the invisible causal chain between a PCIe AER event and an RDMA throughput collapse legible in under five minutes.

Instrument early. Log continuously. Correlate timestamps aggressively. The root cause is almost always in the layer you weren't watching.