TLB Invalidation: The Hidden CPU Tax in AI Inference Systems

A systems problem hiding below GPU utilization

When inference slows down, the first metrics people check are usually GPU utilization, HBM bandwidth, CPU utilization, network throughput, and queue depth. Those are necessary, but not sufficient. A node can look healthy at those layers while the CPU memory-management subsystem is constantly coordinating translation changes underneath the workload.

The GPU is not always slow because the GPU is slow. Sometimes the system is disrupting address translation often enough that the accelerator pipeline loses its rhythm.

This is the motivation behind tlb-invalidation-lab: making TLB invalidation activity observable, attributable, and measurable in modern AI workloads.

From PCIe errors to TLB invalidation

There is a broader pattern in AI infrastructure: second-order system-maintenance costs become first-order performance variables at scale.

PCIe correctable errors  → data movement degradation
TLB invalidation activity → address translation disruption
scheduler pressure        → execution instability

The PCIe problem is about data movement integrity. TLB invalidation is about translation stability. Both can compress effective compute capacity without looking like a clean hardware failure.

What actually happens during a TLB invalidation?

CPUs cache virtual-to-physical translations in Translation Lookaside Buffers. When Linux changes mappings, it must make sure CPUs do not keep using stale translations. That means invalidating affected TLB entries locally and, in many cases, coordinating with other CPUs that may have run the same address space.

Why AI inference workloads can trigger this

Inference is often presented as a GPU-dominant workload, but the host still manages memory, mappings, pinned buffers, runtime allocations, and device coordination.

Host-side triggers

mmap(), munmap(), and mprotect() churn
memory-mapped model files or state
allocator pressure from multi-threaded runtimes
page migration, compaction, and tiered memory behavior

Accelerator-adjacent triggers

GPU Unified Virtual Memory activity
RDMA buffer registration and deregistration
pinned memory lifecycle changes
KV cache movement and paging strategies

The central point is not that every inference system is dominated by TLB invalidations. The point is that when this path is active, most infrastructure stacks do not attribute the resulting latency to translation maintenance.

The P99 problem: invalidation bursts create stragglers

Invalidation activity is dangerous because it is bursty. A small amount of steady overhead might be tolerable. A burst that interrupts CPU threads at the wrong time can create queue instability and amplify tail latency.

TLB invalidation burst
   ↓
CPU interruption / barrier cost / remote coordination
   ↓
runtime threads lose scheduling rhythm
   ↓
batching becomes less stable
   ↓
queue depth rises
   ↓
P99 latency spikes

In a multi-tenant inference node, the problem is not only the process that caused the invalidation activity. A noisy memory-management path can become a straggler source that affects shared service-level latency.

Kernel instrumentation: tracepoints, not printk spam

The repo uses kernel patch sketches to show where TLB invalidation and MMU notifier behavior could be surfaced. The preferred path is tracepoints: structured, filterable, and suitable for time-series collection.

TLB invalidation tracepoint sketch

TRACE_EVENT(tlb_invalidation,

    TP_PROTO(struct mm_struct *mm,
             unsigned long start,
             unsigned long end,
             int target_cpus),

    TP_ARGS(mm, start, end, target_cpus),

    TP_STRUCT__entry(
        __field(pid_t, pid)
        __field(pid_t, tgid)
        __array(char, comm, TASK_COMM_LEN)
        __field(unsigned long, start)
        __field(unsigned long, end)
        __field(unsigned long, range_size)
        __field(int, target_cpus)
    ),

    TP_fast_assign(
        __entry->pid = current->pid;
        __entry->tgid = current->tgid;
        memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
        __entry->start = start;
        __entry->end = end;
        __entry->range_size = end > start ? end - start : 0;
        __entry->target_cpus = target_cpus;
    ),

    TP_printk("pid=%d tgid=%d comm=%s start=0x%lx end=0x%lx size=%lu cpus=%d",
              __entry->pid,
              __entry->tgid,
              __entry->comm,
              __entry->start,
              __entry->end,
              __entry->range_size,
              __entry->target_cpus)
);

MMU notifier tracepoint sketch

TRACE_EVENT(mmu_notifier_invalidation,

    TP_PROTO(struct mm_struct *mm,
             unsigned long start,
             unsigned long end,
             int phase),

    TP_ARGS(mm, start, end, phase),

    TP_STRUCT__entry(
        __field(pid_t, pid)
        __field(pid_t, tgid)
        __array(char, comm, TASK_COMM_LEN)
        __field(unsigned long, start)
        __field(unsigned long, end)
        __field(unsigned long, range_size)
        __field(int, phase)
    ),

    TP_fast_assign(
        __entry->pid = current->pid;
        __entry->tgid = current->tgid;
        memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
        __entry->start = start;
        __entry->end = end;
        __entry->range_size = end > start ? end - start : 0;
        __entry->phase = phase;
    ),

    TP_printk("pid=%d tgid=%d comm=%s start=0x%lx end=0x%lx size=%lu phase=%d",
              __entry->pid,
              __entry->tgid,
              __entry->comm,
              __entry->start,
              __entry->end,
              __entry->range_size,
              __entry->phase)
);

The important design choice is avoiding high-volume logging in hot paths. Tracepoints can be disabled by default and enabled only when investigating a node or running a controlled benchmark.

Architecture probes: x86-64 and arm64

The repo includes educational architecture probe modules that show the instruction-level reality behind invalidation. These are lab-only examples, not replacements for Linux MM code.

x86-64: `invlpg`

static inline void lab_invlpg(void *addr)
{
    asm volatile("invlpg (%0)" : : "r" (addr) : "memory");
}

The instruction invalidates the TLB entry for the page containing the address. Real Linux code chooses between targeted invalidation, PCID-aware behavior, full flushes, and hardware-assisted paths depending on architecture support and workload shape.

arm64: TLBI + barriers

dsb ishst
// tlbi vaae1is, xN
dsb ish
isb

The barriers are part of the cost. arm64 invalidation is not just a single conceptual operation; ordering and visibility matter.

These probes are for lab machines, VMs, and disposable test kernels. They should not be treated as production kernel modules.

The observability workflow

The goal is not merely to count invalidations. The goal is to correlate invalidation bursts with latency, GPU behavior, and scheduler pressure.

One-command experiment

./run_experiment.sh

Manual workflow

# Build and run synthetic memory-management stressors
cd benchmarks && make && ./run_bench.sh

# Collect kernel trace data
sudo ./tools/collect_trace.sh --duration 30

# Parse timestamps and generate time-series
python3 tools/parse_trace.py --input out/tlb_trace.txt --out out/tlb_timeseries.csv

# Sample GPU and scheduler behavior
python3 tools/sample_gpu.py --duration 30 --interval 1
python3 tools/sample_sched.py --duration 30 --interval 1

# Correlate invalidation activity with latency/GPU/scheduler signals
python3 tools/correlate_latency.py \
  --tlb out/tlb_timeseries.csv \
  --gpu out/gpu_timeseries.csv \
  --sched out/sched_timeseries.csv

# Generate plots
python3 tools/plot_correlation.py --input out/correlation.csv

What the metrics mean

Metric	Meaning	Why it matters
`invalidations/sec`	How frequently translation entries are invalidated	Captures bursty translation churn
`bytes invalidated/sec`	Total virtual address range invalidated over time	Separates tiny churn from large range invalidation
`target_cpu_count`	Approximate cross-core fanout	Higher fanout often means more coordination cost
`per-process labels`	Process-level attribution	Useful on multi-tenant inference nodes
`normalized score`	Invalidations per second per core	Avoids falsely flagging large servers as unhealthy

Health score normalization

normalized_invalidations = invalidations_per_second / cpu_count

if normalized_invalidations < 5:
    status = "green"
elif normalized_invalidations < 20:
    status = "yellow"
else:
    status = "red"

Raw thresholds are misleading. A 256-core server will naturally produce different aggregate counts than an 8-core workstation. Normalizing per core gives a more portable starting point.

Example synthetic result

The repo includes example data to demonstrate the workflow. These numbers are synthetic and are not a claim about any particular production system.

Workload	Threads	Invalidations/sec	Per-core rate	P99 latency	Interpretation
baseline	1	180	2.8	4.1 ms	Low translation churn
mmap churn	8	2,900	45.3	17.8 ms	High invalidation activity
mprotect churn	16	5,800	90.6	31.4 ms	Severe synthetic stress
mixed sampler	16	4,200	65.6	24.7 ms	Scheduler pressure also elevated

The point is not that invalidation activity always causes latency spikes. The point is that without timestamped attribution, you cannot even test the relationship.

Prometheus metrics with process attribution

Aggregate-only metrics are not enough on multi-tenant nodes. The exporter should expose top processes while limiting cardinality.

tlb_invalidations_total{pid="2341",comm="python"} 18420
tlb_invalidation_rate{pid="2341",comm="python"} 612.4
tlb_bytes_invalidated_total{pid="2341",comm="python"} 734003200
tlb_health_score{pid="2341",comm="python"} 82

tlb_invalidations_total{pid="other",comm="other"} 9012

The cardinality guard matters. You do not want a debugging exporter to become the next source of production pain.

QEMU bootstrap: reducing adoption friction

Kernel observability projects often fail because the first step is “build a custom kernel and good luck.” The repo includes QEMU scaffolding so users can test the concept in a lab environment.

cd vm
./run_qemu.sh

The current VM path should be understood as a kernel boot smoke test unless a full rootfs/initramfs is provided. A future improvement is a tiny BusyBox initramfs that mounts tracefs and runs the benchmark suite inside the guest.

What this can reveal

Useful signals

Invalidation bursts preceding P99 spikes
GPU utilization dips during high MMU activity
Specific processes responsible for translation churn
Scheduler pressure amplifying invalidation effects

Common confounders

CPU runqueue pressure
page cache behavior
NUMA locality changes
allocator changes unrelated to GPU execution

What this is not

This project is intentionally conservative. It does not claim to fix Linux memory management, eliminate invalidations, or prove that TLB invalidation is the bottleneck in every AI inference system.

Observability Attribution Correlation Synthetic benchmarks Kernel patch sketches

Correlation is not causation. The value is that the signal becomes visible enough to investigate with controlled experiments.

The broader thesis

AI infrastructure is often discussed as if the stack is only compute, memory bandwidth, and networking. But at scale, the machine’s internal coordination costs become visible:

Compute is not just FLOPs.
Memory is not just bandwidth.
Networking is not just throughput.

The system also pays for consistency, translation, ordering, and recovery.

PCIe errors expose data movement degradation. TLB invalidations expose address translation disruption. Scheduler pressure exposes execution instability. These are hidden variables in effective capacity.

The better question

Instead of asking only “Is the GPU fully utilized?”, ask: Is the system stable enough to sustain utilization?

Repository

The project is here:

github.com/manishklach/tlb-invalidation-lab →

Use the patch sketches and architecture probes only on lab machines, VMs, or disposable test kernels.

A systems problem hiding below GPU utilization

From PCIe errors to TLB invalidation

What actually happens during a TLB invalidation?

Why AI inference workloads can trigger this

Host-side triggers

Accelerator-adjacent triggers

The P99 problem: invalidation bursts create stragglers

Kernel instrumentation: tracepoints, not printk spam

TLB invalidation tracepoint sketch

MMU notifier tracepoint sketch

Architecture probes: x86-64 and arm64

x86-64: invlpg

arm64: TLBI + barriers

The observability workflow

One-command experiment

Manual workflow

What the metrics mean

Health score normalization

Example synthetic result

Prometheus metrics with process attribution

QEMU bootstrap: reducing adoption friction

What this can reveal

Useful signals

Common confounders

What this is not

The broader thesis

The better question

Repository

x86-64: `invlpg`