A systems problem hiding below GPU utilization
When inference slows down, the first metrics people check are usually GPU utilization, HBM bandwidth, CPU utilization, network throughput, and queue depth. Those are necessary, but not sufficient. A node can look healthy at those layers while the CPU memory-management subsystem is constantly coordinating translation changes underneath the workload.
This is the motivation behind tlb-invalidation-lab: making TLB invalidation activity observable, attributable, and measurable in modern AI workloads.
From PCIe errors to TLB invalidation
There is a broader pattern in AI infrastructure: second-order system-maintenance costs become first-order performance variables at scale.
PCIe correctable errors → data movement degradation TLB invalidation activity → address translation disruption scheduler pressure → execution instability
The PCIe problem is about data movement integrity. TLB invalidation is about translation stability. Both can compress effective compute capacity without looking like a clean hardware failure.
What actually happens during a TLB invalidation?
CPUs cache virtual-to-physical translations in Translation Lookaside Buffers. When Linux changes mappings, it must make sure CPUs do not keep using stale translations. That means invalidating affected TLB entries locally and, in many cases, coordinating with other CPUs that may have run the same address space.
Why AI inference workloads can trigger this
Inference is often presented as a GPU-dominant workload, but the host still manages memory, mappings, pinned buffers, runtime allocations, and device coordination.
Host-side triggers
mmap(),munmap(), andmprotect()churn- memory-mapped model files or state
- allocator pressure from multi-threaded runtimes
- page migration, compaction, and tiered memory behavior
Accelerator-adjacent triggers
- GPU Unified Virtual Memory activity
- RDMA buffer registration and deregistration
- pinned memory lifecycle changes
- KV cache movement and paging strategies
The central point is not that every inference system is dominated by TLB invalidations. The point is that when this path is active, most infrastructure stacks do not attribute the resulting latency to translation maintenance.
The P99 problem: invalidation bursts create stragglers
Invalidation activity is dangerous because it is bursty. A small amount of steady overhead might be tolerable. A burst that interrupts CPU threads at the wrong time can create queue instability and amplify tail latency.
TLB invalidation burst ↓ CPU interruption / barrier cost / remote coordination ↓ runtime threads lose scheduling rhythm ↓ batching becomes less stable ↓ queue depth rises ↓ P99 latency spikes
Kernel instrumentation: tracepoints, not printk spam
The repo uses kernel patch sketches to show where TLB invalidation and MMU notifier behavior could be surfaced. The preferred path is tracepoints: structured, filterable, and suitable for time-series collection.
TLB invalidation tracepoint sketch
TRACE_EVENT(tlb_invalidation,
TP_PROTO(struct mm_struct *mm,
unsigned long start,
unsigned long end,
int target_cpus),
TP_ARGS(mm, start, end, target_cpus),
TP_STRUCT__entry(
__field(pid_t, pid)
__field(pid_t, tgid)
__array(char, comm, TASK_COMM_LEN)
__field(unsigned long, start)
__field(unsigned long, end)
__field(unsigned long, range_size)
__field(int, target_cpus)
),
TP_fast_assign(
__entry->pid = current->pid;
__entry->tgid = current->tgid;
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
__entry->start = start;
__entry->end = end;
__entry->range_size = end > start ? end - start : 0;
__entry->target_cpus = target_cpus;
),
TP_printk("pid=%d tgid=%d comm=%s start=0x%lx end=0x%lx size=%lu cpus=%d",
__entry->pid,
__entry->tgid,
__entry->comm,
__entry->start,
__entry->end,
__entry->range_size,
__entry->target_cpus)
);
MMU notifier tracepoint sketch
TRACE_EVENT(mmu_notifier_invalidation,
TP_PROTO(struct mm_struct *mm,
unsigned long start,
unsigned long end,
int phase),
TP_ARGS(mm, start, end, phase),
TP_STRUCT__entry(
__field(pid_t, pid)
__field(pid_t, tgid)
__array(char, comm, TASK_COMM_LEN)
__field(unsigned long, start)
__field(unsigned long, end)
__field(unsigned long, range_size)
__field(int, phase)
),
TP_fast_assign(
__entry->pid = current->pid;
__entry->tgid = current->tgid;
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
__entry->start = start;
__entry->end = end;
__entry->range_size = end > start ? end - start : 0;
__entry->phase = phase;
),
TP_printk("pid=%d tgid=%d comm=%s start=0x%lx end=0x%lx size=%lu phase=%d",
__entry->pid,
__entry->tgid,
__entry->comm,
__entry->start,
__entry->end,
__entry->range_size,
__entry->phase)
);
The important design choice is avoiding high-volume logging in hot paths. Tracepoints can be disabled by default and enabled only when investigating a node or running a controlled benchmark.
Architecture probes: x86-64 and arm64
The repo includes educational architecture probe modules that show the instruction-level reality behind invalidation. These are lab-only examples, not replacements for Linux MM code.
x86-64: invlpg
static inline void lab_invlpg(void *addr)
{
asm volatile("invlpg (%0)" : : "r" (addr) : "memory");
}
The instruction invalidates the TLB entry for the page containing the address. Real Linux code chooses between targeted invalidation, PCID-aware behavior, full flushes, and hardware-assisted paths depending on architecture support and workload shape.
arm64: TLBI + barriers
dsb ishst
// tlbi vaae1is, xN
dsb ish
isb
The barriers are part of the cost. arm64 invalidation is not just a single conceptual operation; ordering and visibility matter.
The observability workflow
The goal is not merely to count invalidations. The goal is to correlate invalidation bursts with latency, GPU behavior, and scheduler pressure.
One-command experiment
./run_experiment.sh
Manual workflow
# Build and run synthetic memory-management stressors
cd benchmarks && make && ./run_bench.sh
# Collect kernel trace data
sudo ./tools/collect_trace.sh --duration 30
# Parse timestamps and generate time-series
python3 tools/parse_trace.py --input out/tlb_trace.txt --out out/tlb_timeseries.csv
# Sample GPU and scheduler behavior
python3 tools/sample_gpu.py --duration 30 --interval 1
python3 tools/sample_sched.py --duration 30 --interval 1
# Correlate invalidation activity with latency/GPU/scheduler signals
python3 tools/correlate_latency.py \
--tlb out/tlb_timeseries.csv \
--gpu out/gpu_timeseries.csv \
--sched out/sched_timeseries.csv
# Generate plots
python3 tools/plot_correlation.py --input out/correlation.csv
What the metrics mean
| Metric | Meaning | Why it matters |
|---|---|---|
invalidations/sec | How frequently translation entries are invalidated | Captures bursty translation churn |
bytes invalidated/sec | Total virtual address range invalidated over time | Separates tiny churn from large range invalidation |
target_cpu_count | Approximate cross-core fanout | Higher fanout often means more coordination cost |
per-process labels | Process-level attribution | Useful on multi-tenant inference nodes |
normalized score | Invalidations per second per core | Avoids falsely flagging large servers as unhealthy |
Health score normalization
normalized_invalidations = invalidations_per_second / cpu_count
if normalized_invalidations < 5:
status = "green"
elif normalized_invalidations < 20:
status = "yellow"
else:
status = "red"
Raw thresholds are misleading. A 256-core server will naturally produce different aggregate counts than an 8-core workstation. Normalizing per core gives a more portable starting point.
Example synthetic result
The repo includes example data to demonstrate the workflow. These numbers are synthetic and are not a claim about any particular production system.
| Workload | Threads | Invalidations/sec | Per-core rate | P99 latency | Interpretation |
|---|---|---|---|---|---|
| baseline | 1 | 180 | 2.8 | 4.1 ms | Low translation churn |
| mmap churn | 8 | 2,900 | 45.3 | 17.8 ms | High invalidation activity |
| mprotect churn | 16 | 5,800 | 90.6 | 31.4 ms | Severe synthetic stress |
| mixed sampler | 16 | 4,200 | 65.6 | 24.7 ms | Scheduler pressure also elevated |
Prometheus metrics with process attribution
Aggregate-only metrics are not enough on multi-tenant nodes. The exporter should expose top processes while limiting cardinality.
tlb_invalidations_total{pid="2341",comm="python"} 18420
tlb_invalidation_rate{pid="2341",comm="python"} 612.4
tlb_bytes_invalidated_total{pid="2341",comm="python"} 734003200
tlb_health_score{pid="2341",comm="python"} 82
tlb_invalidations_total{pid="other",comm="other"} 9012
The cardinality guard matters. You do not want a debugging exporter to become the next source of production pain.
QEMU bootstrap: reducing adoption friction
Kernel observability projects often fail because the first step is “build a custom kernel and good luck.” The repo includes QEMU scaffolding so users can test the concept in a lab environment.
cd vm
./run_qemu.sh
The current VM path should be understood as a kernel boot smoke test unless a full rootfs/initramfs is provided. A future improvement is a tiny BusyBox initramfs that mounts tracefs and runs the benchmark suite inside the guest.
What this can reveal
Useful signals
- Invalidation bursts preceding P99 spikes
- GPU utilization dips during high MMU activity
- Specific processes responsible for translation churn
- Scheduler pressure amplifying invalidation effects
Common confounders
- CPU runqueue pressure
- page cache behavior
- NUMA locality changes
- allocator changes unrelated to GPU execution
What this is not
This project is intentionally conservative. It does not claim to fix Linux memory management, eliminate invalidations, or prove that TLB invalidation is the bottleneck in every AI inference system.
The broader thesis
AI infrastructure is often discussed as if the stack is only compute, memory bandwidth, and networking. But at scale, the machine’s internal coordination costs become visible:
Compute is not just FLOPs. Memory is not just bandwidth. Networking is not just throughput. The system also pays for consistency, translation, ordering, and recovery.
PCIe errors expose data movement degradation. TLB invalidations expose address translation disruption. Scheduler pressure exposes execution instability. These are hidden variables in effective capacity.
The better question
Repository
The project is here:
github.com/manishklach/tlb-invalidation-lab →
Use the patch sketches and architecture probes only on lab machines, VMs, or disposable test kernels.