Workload model: agentic AI as a kernel-latency workload
An agent runtime is not a batch job. It is a real-time loop: plan a next action, invoke a tool, observe the result, update state, plan again. Each iteration is short. Each iteration crosses the kernel boundary many times. And iterations are synchronous — the agent cannot begin step N+1 until step N completes. This makes latency, not throughput, the primary performance dimension.
The hot path for a single agent step looks like this at the syscall level:
The critical insight is latency amplification through repetition. A 2ms penalty that would be irrelevant in a batch job is applied hundreds of times per agent session. Consider a concrete step budget:
22 seconds of kernel overhead that is invisible to every application-level profiler. The model inference time was fine. The framework code was fine. The kernel was the bottleneck the whole time.
Scheduler wakeup latency: the hidden tax
Agent runtimes are structurally multi-process and multi-threaded. A typical step involves: the orchestrator thread dispatching a task to a worker pool, the worker spawning a subprocess for a tool, reading its output over a pipe, signaling a completion futex, and the orchestrator waking to process the result. That is five scheduler events per tool call, minimum.
The Linux CFS scheduler provides excellent average-case fairness and throughput. It is not optimized for the pattern agentic workloads impose: very short CPU bursts (10–500µs) with very high wakeup frequency (thousands per second) from a single latency-sensitive orchestrator thread.
Where wakeup latency comes from
The subprocess spawn penalty
Many agent tools execute as subprocesses: shell commands, Python scripts, MCP servers, external binaries. Each fork() + execve() carries a large kernel cost that is easy to underestimate:
# Measured on a modern server: agent tool spawn breakdown
# (perf stat -e sched:sched_process_fork,sched:sched_process_exec)
fork():
copy_process() ~180µs # mm_struct dup, fd table copy
dup_mm() + cow setup ~90µs # page table copy, not the pages
sched_fork() ~15µs # new task_struct init
execve():
bprm_mm_init() ~60µs # new address space
load_elf_binary() ~240µs # ELF parsing, mmap segments
flush_old_exec() ~45µs # TLB full shootdown
setup_arg_pages() ~20µs # stack mapping
------------------------------------------
Total fork+exec: ~650µs per tool invocation
× 200 tool calls: 130ms pure kernel overhead
The flush_old_exec() call in execve() performs a full TLB shootdown — it sends IPIs to every CPU that has ever used this address space's page tables. On a 128-core server, this is 127 IPIs, each requiring a cross-core acknowledgment. On a NUMA machine, remote node IPIs take 3–5× longer than local. This single call can consume 300–500µs on a heavily loaded machine and is triggered by every subprocess spawn.
The wakeup latency histogram: what perf shows
# bpftrace: measure wakeup latency for agent orchestrator threads
$ bpftrace -e '
tracepoint:sched:sched_wakeup { @ts[args->pid] = nsecs; }
tracepoint:sched:sched_switch {
if (@ts[args->next_pid]) {
$lat = (nsecs - @ts[args->next_pid]) / 1000;
@wakeup_us = hist($lat);
delete(@ts[args->next_pid]);
}
}'
# Example output for a Claude Code agent session:
@wakeup_us:
[1, 2) ████████████████████ 8432 # fast path: same CPU
[2, 4) ██████████ 4210 # cross-core same socket
[4, 8) ████████ 3109 # different socket
[8, 16) ████ 1580 # under moderate load
[16, 32) ███ 940 # runqueue lock contention
[32, 64) ██ 470 # NUMA migration triggered
[64, 128) █ 220 # heavy load
[128, 256) 48 # severe contention
[256, ∞) 12 # outliers — kill these
The long tail above 64µs is where agent step latency suffers. These are not random events — they cluster around specific causes: NUMA migration, runqueue lock contention, and CFS period expiry when the orchestrator's vruntime has drifted behind other tasks on the same CPU.
The NUMA migration trap
When an agent spawns a tool process that runs briefly on a remote NUMA node, the NUMA balancer may later decide to migrate the orchestrator thread toward that node to "follow" the memory access pattern. This migration makes the orchestrator worse: it is now far from its own hot pages and far from the GPU it may need to dispatch work to. The sched_migrate_task tracepoint fires when this happens but carries no semantic information about whether the migration is helping or hurting.
Memory: page-cache thrash, mmap faults, and TLB pressure
Agentic AI workloads have a distinctive memory access signature: a small hot set (the retrieval index, the active context, recent tool outputs) accessed with high frequency, surrounded by a large cold set (repos being scanned, historical logs, documents being parsed once) that constantly churns through the page cache.
3.1 Page-cache thrash: semantic blindness
Linux's page cache uses an LRU-based replacement policy with an active/inactive split. The kernel has no concept of why a page was accessed — whether it contains a hot retrieval index shard that will be needed in the next 100 milliseconds, or a cold log entry that will never be read again. A single large find . -name "*.py" | xargs grep "pattern" tool call can evict gigabytes of hot index pages that took seconds to fault in.
3.2 mmap fault storms: the worst-case retrieval pattern
Vector indexes — FAISS, Chroma, Weaviate's mmap files — are typically memory-mapped. The first access to a page in a mapped file triggers a minor fault (page in page cache, just not mapped into this process's page table) or a major fault (page not in page cache, requires a disk read). After a page-cache eviction, what were minor faults become major faults.
# Perf output during agent retrieval after page-cache eviction
$ perf stat -e exceptions:page_fault_user,exceptions:page_fault_kernel \
-e major-faults,minor-faults \
-p $AGENT_PID -- sleep 10
Performance counter stats for process id '18421':
2,847,201 minor-faults # expected: pte updates
14,820 major-faults # BAD: every one is a disk read
# 14820 × ~0.5ms = 7.4s stalled
10.003215441 seconds time elapsed
14,820 major faults at ~0.5ms each is 7.4 seconds of stall time invisible to the agent. The retrieval call returns "slow" with no further attribution.
3.3 TLB pressure: the mmap tax no one talks about
Every mmap region requires TLB entries. Each context switch between processes — every subprocess spawn, every thread wakeup — may require TLB invalidations. On x86_64 with PCID disabled (the default on many distributions and VMs), every context switch flushes the entire TLB. For an agent constantly switching between its orchestrator, tool workers, and subprocesses, this means:
Enable PCID (Process Context Identifiers) by booting with pti=on and ensuring the CPU supports it (CPUID.01H:ECX.PCID[bit 17]). With PCID enabled, context switches do not flush the TLB unless the PCID is recycled. This alone can reduce wakeup latency 20–40% for agent workloads. Verify with: grep pcid /proc/cpuinfo
VFS & dentry cache: metadata pressure and lock contention
The Virtual Filesystem layer is where every file-related syscall begins. It maintains the dentry cache (directory entry cache — pathname component lookups) and the inode cache (file metadata). For agent workloads that constantly open, stat, read, and close files across large repository trees, the dentry cache is a critical hot path that is frequently overlooked.
4.1 Dentry cache exhaustion: the silent metadata bottleneck
A typical agent tool call to search a large codebase might access tens of thousands of file paths. Each path component requires a dentry lookup: /home, /home/user, /home/user/repo, /home/user/repo/src, etc. Each dentry lookup acquires the d_lock spinlock. Under parallel tool execution, this lock becomes a serialization point:
# flamegraph showing dentry lock contention during parallel grep
# (captured with: perf record -ag -- sleep 5, then stackcollapse)
sys_openat
└─ do_filp_open
└─ path_openat
└─ link_path_walk
└─ lookup_fast ← 34% of CPU time here
└─ __d_lookup_rcu
└─ d_hash
└─ hlist_bl_lock ← spinning on this
At 34% of CPU time spent in pathname lookup, parallel tool calls are not CPU-bound on useful work — they are contending on a hash bucket spinlock in the dentry cache.
4.2 The stat() cascade
Agents frequently invoke tools that call stat() on many files: file-listing tools, context-gathering tools, build tools checking timestamps. Each stat() traverses the dentry tree to the target file, acquires inode locks to read metadata, and may trigger a VFS attribute revalidation if the filesystem's cache has expired.
# strace snippet: agent tool doing directory enumeration
openat(AT_FDCWD, "/repo/src", O_RDONLY|O_DIRECTORY) = 8 # 0.12ms
newfstatat(8, "main.py", ...) = 0 # 0.08ms
newfstatat(8, "utils.py", ...) = 0 # 0.08ms
newfstatat(8, "config.py", ...) = 0 # 0.09ms
# ... × 2,400 files
# = 192ms of stat() calls before tool work begins
# mount with relatime reduces atime update locks
# but does nothing for the dentry traversal
If the agent runtime uses inotify or fanotify to watch directories (common in development tool agents), every stat(), open(), and close() by any process generates a filesystem event that the inotify file descriptor must deliver. This doubles the kernel work per file operation and, under heavy tool parallelism, can saturate the inotify event queue, causing events to be dropped silently.
4.3 The rename/unlink journal penalty
Agents that write state files atomically (write to temp, rename to final) pay a journal commit cost on every rename on ext4 or XFS. The vfs_rename() path acquires multiple directory inode locks, appends a journal entry, and on many filesystem configurations may call jbd2_log_force() to flush the journal before returning. For an agent writing 50 state files per second, this is 50 journal commits per second — a throughput limit that is completely invisible in CPU utilization but visible in latency spikes.
# Measure journal commit latency with ftrace
$ echo 'jbd2:jbd2_commit_locking' > /sys/kernel/debug/tracing/set_event
$ echo 'jbd2:jbd2_commit_flushing' >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe | head -40
# Output during agent state writes:
jbd2-sda1-8 [003] 1234.567: jbd2_commit_flushing: dev 8,1 transaction 44821
jbd2-sda1-8 [003] 1234.571: jbd2_commit_locked: ... +4.2ms ← each state write
jbd2-sda1-8 [003] 1234.576: jbd2_commit_flushing: dev 8,1 transaction 44822
jbd2-sda1-8 [003] 1234.579: jbd2_commit_locked: ... +3.8ms
Block I/O: small random reads and blk-mq pathology
Agent workloads impose an I/O profile that block layer defaults are not optimized for: small (4KB–64KB), random, latency-sensitive reads with bursty, fsync-heavy writes. The blk-mq subsystem and NVMe SSD firmware are tuned primarily for throughput workloads or for mixed steady-state. Neither profile matches agent I/O.
The blk plug problem for agent reads
The blk plug is a per-task queue that batches block requests before sending them to the hardware queue. It was designed to improve throughput on spinning disks and remains the default behavior. For NVMe SSDs with deep hardware queues and 50µs base latency, plugging a request for up to 200µs to collect a larger batch more than doubles the latency of that read. Agents reading from vector indexes suffer this penalty on every cache-miss read:
# Force latency-optimized I/O for a vector index fd
# Method 1: io_uring with RWF_NOWAIT (skip plug, return immediately if not in cache)
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, index_fd, buf, 4096, offset);
io_uring_sqe_set_flags(sqe, IOSQE_ASYNC); # force async, bypass plug
sqe->rw_flags = RWF_NOWAIT; # non-blocking: return -EAGAIN if not ready
# Method 2: set scheduler to none for NVMe drives (system-wide)
echo none > /sys/block/nvme0n1/queue/scheduler
# Method 3: tune blk-mq nr_requests for lower latency
echo 64 > /sys/block/nvme0n1/queue/nr_requests # default is 256
fsync storms from trace/log persistence
Agents generate continuous trace output: observations, tool outputs, reasoning steps, state snapshots. Frameworks that call fsync() after each write serialize every subsequent write behind a journal flush. On ext4 with data=ordered (default), fsync() blocks until the journal commits and data pages are written. Fifty trace writes per second with fsync becomes fifty journal commits per second:
# Measure fsync cost distribution
$ bpftrace -e '
tracepoint:syscalls:sys_enter_fsync { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_fsync {
$lat = (nsecs - @start[tid]) / 1000;
@fsync_us = hist($lat);
delete(@start[tid]);
}'
@fsync_us:
[64, 128) ████████████████ 1,284 # normal journal commit
[128, 256) ████████ 612 # under write pressure
[256, 512) ████ 320 # journal checkpoint needed
[512, 1024) ██ 95 # jbd2 waiting for I/O
[1024, 2048) █ 28 # device queue saturated
[2048, ∞) 4 # severe stall
The central missing feature: step-level attribution
The kernel already emits all the events needed to fully characterize agent step latency: sched:sched_switch, exceptions:page_fault_user, block:block_rq_issue, vfs:vfs_read, jbd2:jbd2_commit_flushing. The fundamental missing piece is a stable identity that connects all these events back to the logical unit the user cares about: the agent step.
Without this, a slow step collapses to "the agent was slow." With it, you get:
# Ideal output: per-step kernel attribution report
agent-step report: step_id=142 tool="search_codebase"
─────────────────────────────────────────────
syscall overhead: 1.8 ms (open × 847, stat × 2400)
dentry cache misses: 0.9 ms (312 cold lookups)
mmap major faults: 7.4 ms (14,820 faults × 0.5ms)
block I/O p99: 9.2 ms (vector_index.db reads)
scheduler wakeup tax: 0.6 ms (fork+exec × 2)
journal commit stalls: 0.0 ms (no fsync this step)
TLB shootdown time: 0.3 ms (3 context switches)
─────────────────────────────────────────────
total kernel overhead: 20.2 ms
verdict: mmap fault storm — vector_index.db evicted from page cache
recommendation: madvise(MADV_WILLNEED) before step or pin with mlock()
Kernel-level patch directions
The goal is not to make the kernel "AI-aware" in a vague way. The goal is to add the minimal set of hooks, hints, and identity propagation needed to measure, debug, and optimize agentic latency. Each patch below is motivated by a concrete failure mode described above.
Add a u64 agent_step_id field to task_struct, settable via prctl(), inherited across fork() by default, and readable by any tracepoint via current. The field is zero when not in use — zero overhead for non-agent workloads.
/* include/linux/sched.h */
struct task_struct {
/* ... existing fields ... */
u64 agent_step_id; /* 0 = unset */
};
/* kernel/sys.c — prctl handler */
case PR_SET_AGENT_STEP_ID:
current->agent_step_id = arg2;
return 0;
/* Userspace usage */
prctl(PR_SET_AGENT_STEP_ID, step_id,
0, 0, 0);
Extend mm_filemap_fault with inode number, page offset, fault type (minor/major), and the new agent_step_id. This closes the mmap fault storm attribution gap entirely and requires only a tracepoint field addition.
/* mm/filemap.c */
TRACE_EVENT(mm_filemap_fault,
TP_PROTO(
struct inode *inode,
pgoff_t index,
bool is_major,
u64 latency_ns,
u64 agent_step_id /* NEW */
),
TP_fast_assign(
__entry->ino = inode->i_ino;
__entry->index = index;
__entry->major = is_major;
__entry->lat_ns = latency_ns;
__entry->step = agent_step_id;
)
);
Add a new madvise() hint MADV_AGENT_HOT that marks a mapped region as high-priority for the page reclaim algorithm. Pages marked this way are moved to the top of the active LRU list and are exempt from reclaim as long as the marking holds.
/* include/uapi/linux/mman.h */
#define MADV_AGENT_HOT 25 /* hint: critical for next step */
#define MADV_AGENT_COLD 26 /* hint: ok to evict aggressively */
/* Usage for vector index */
madvise(index_mmap, index_size,
MADV_AGENT_HOT);
/* Usage for one-shot scan output */
madvise(scan_buf, scan_size,
MADV_AGENT_COLD | MADV_SEQUENTIAL);
Add IOSQE_LATENCY_PRIORITY to io_uring SQE flags. When set, the blk-mq layer submits the request without plugging and moves it to the front of the software staging queue. This gives agent retrieval reads p99 latency reduction without affecting background I/O.
/* include/uapi/linux/io_uring.h */
#define IOSQE_LATENCY_PRIORITY (1U << 7)
/* io_uring/rw.c */
if (req->flags & IOSQE_LATENCY_PRIORITY) {
kiocb->ki_flags |= IOCB_HIPRI;
blk_plug_invalidate(current_plug);
}
/* Userspace: latency-first read */
sqe->flags |= IOSQE_LATENCY_PRIORITY;
Introduce O_SYNC_WINDOW open flag: writes to this fd are durable within a configurable time window (e.g. 100ms) rather than on every fsync(). Combines journal commits from concurrent trace writers, reducing journal pressure 10–50× for typical agent log rates.
/* Motivation: agent writes 50 trace events/s */
/* Each fsync = 1 journal commit = 3-8ms stall */
/* O_SYNC_WINDOW groups commits in 100ms window */
fd = open("trace.log",
O_WRONLY | O_APPEND | O_SYNC_WINDOW,
0644);
/* kernel batches fsync calls from this fd */
/* max durability gap = sync_window_ms sysctl */
/* safe: on crash, at most 100ms of traces lost */
Add SCHED_NORMAL_LATENCY policy variant: like SCHED_NORMAL but with a configurable wakeup latency budget. The scheduler preferentially places these threads on a CPU with a short runqueue and avoids cross-NUMA migration. Not realtime — no preemption guarantees — just latency-aware placement.
/* kernel/sched/policy.h */
#define SCHED_NORMAL_LATENCY 7
struct sched_attr attr = {
.size = sizeof(attr),
.sched_policy = SCHED_NORMAL_LATENCY,
.sched_runtime = 500000, /* 500µs burst budget */
.sched_deadline= 2000000, /* 2ms wakeup target */
};
sched_setattr(orchestrator_tid, &attr, 0);
The safest upstream sequence is: (1) ship tracepoint enrichment patches first — they are always welcomed; (2) add the madvise hints as they extend an existing ABI cleanly; (3) propose the io_uring flag — io_uring maintainer Jens Axboe is receptive to latency hints; (4) task_struct changes require broader review. The scheduler policy change is the hardest sell and would need extensive benchmark data from real agent workloads before any maintainer would consider it.
eBPF tooling: observe before optimizing
Before proposing any kernel policy change, a credible implementation must ship with observability tooling. The first tool answers exactly one question: which kernel subsystem made this agent step slow? This tool can be built today with existing tracepoints, without any of the patches above.
#!/usr/bin/env bpftrace
/*
* agent-step-profiler.bt
* Attribute kernel latency to agent steps by PID.
* Usage: bpftrace agent-step-profiler.bt -p $AGENT_PID
*
* Today: groups by PID. After Patch A: groups by agent_step_id.
*/
BEGIN { printf("Tracing agent kernel path. Ctrl-C to report.\n"); }
/* ── SCHEDULER ─────────────────────────────────────── */
tracepoint:sched:sched_wakeup /pid == $1/ {
@sched_wake_ts[tid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_pid == $1/ {
if (@sched_wake_ts[tid]) {
@sched_us = hist((nsecs - @sched_wake_ts[tid]) / 1000);
@sched_total_us += (nsecs - @sched_wake_ts[tid]) / 1000;
delete(@sched_wake_ts[tid]);
}
}
/* ── PAGE FAULTS ────────────────────────────────────── */
tracepoint:exceptions:page_fault_user /pid == $1/ {
@fault_ts[tid] = nsecs;
@fault_count++;
}
tracepoint:exceptions:page_fault_user_return /pid == $1/ {
if (@fault_ts[tid]) {
@fault_us = hist((nsecs - @fault_ts[tid]) / 1000);
@fault_total_us += (nsecs - @fault_ts[tid]) / 1000;
delete(@fault_ts[tid]);
}
}
/* ── BLOCK I/O ──────────────────────────────────────── */
tracepoint:block:block_rq_issue /pid == $1/ {
@bio_ts[args->sector] = nsecs;
}
tracepoint:block:block_rq_complete {
if (@bio_ts[args->sector]) {
$lat = (nsecs - @bio_ts[args->sector]) / 1000;
@bio_us = hist($lat);
@bio_total_us += $lat;
delete(@bio_ts[args->sector]);
}
}
/* ── VFS METADATA ───────────────────────────────────── */
tracepoint:syscalls:sys_enter_newfstatat /pid == $1/ {
@stat_count++;
}
tracepoint:syscalls:sys_enter_openat /pid == $1/ {
@open_ts[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_openat /pid == $1/ {
if (@open_ts[tid]) {
@open_us += (nsecs - @open_ts[tid]) / 1000;
delete(@open_ts[tid]);
}
}
END {
printf("\n═══ AGENT KERNEL PROFILE ═══\n");
printf("Scheduler wakeup total: %lld µs\n", @sched_total_us);
printf("Page fault total: %lld µs (%lld faults)\n",
@fault_total_us, @fault_count);
printf("Block I/O total: %lld µs\n", @bio_total_us);
printf("VFS open() total: %lld µs\n", @open_us);
printf("VFS stat() count: %lld\n", @stat_count);
printf("\n── Wakeup latency histogram (µs) ──\n");
print(@sched_us);
printf("\n── Page fault latency histogram (µs) ──\n");
print(@fault_us);
printf("\n── Block I/O latency histogram (µs) ──\n");
print(@bio_us);
}
This script runs against any agent process today with no kernel changes. Run it during a representative agent session, then compare the three totals. Whichever dominates is where optimization effort should go first.
Once the dominant subsystem is identified, add targeted bpftrace scripts to drill down further. For page fault storms: trace the specific file inode causing faults. For block I/O: trace by device and request size. For scheduler: trace by runqueue depth at wakeup time. The data shapes the patch — not the other way around.
Connecting to the perf ecosystem
# Full kernel profile for an agent session using perf
$ perf record \
-e sched:sched_switch \
-e sched:sched_wakeup \
-e exceptions:page_fault_user \
-e block:block_rq_issue \
-e block:block_rq_complete \
-e filemap:mm_filemap_add_to_page_cache \
-e writeback:writeback_dirty_page \
-e syscalls:sys_enter_openat \
-e syscalls:sys_enter_newfstatat \
--call-graph dwarf \
-p $AGENT_PID \
-- sleep 30
# Generate flamegraph
$ perf script | stackcollapse-perf.pl | flamegraph.pl > agent_kernel.svg
# Aggregate by subsystem
$ perf stat \
-e sched:sched_migrate_task \
-e tlb:tlb_flush \
-e major-faults,minor-faults \
-e block:block_rq_issue \
-p $AGENT_PID \
-- sleep 30
Conclusion: agentic AI is an OS workload
The performance model for AI systems is changing. Batch training throughput was always a GPU problem. Inference throughput is still primarily a GPU problem. But agentic AI — the loops of tool calls, retrieval queries, state updates, and subprocess orchestration that power AI assistants, coding agents, and autonomous systems — is increasingly a kernel problem.
The kernel subsystems that matter most for agent latency, in order of typical impact:
| Rank | Subsystem | Primary failure mode | Best first fix |
|---|---|---|---|
| 1 | mm / page cache | mmap fault storms after page-cache eviction by cold tool scans | madvise(MADV_WILLNEED) before retrieval step |
| 2 | Scheduler | Subprocess spawn TLB flush storms; NUMA migration of orchestrator | Enable PCID; pin orchestrator thread to fixed CPU with pthread_setaffinity |
| 3 | blk-mq / NVMe | blk plug adds 200µs to latency-sensitive small reads | Set scheduler to none for NVMe devices; use RWF_NOWAIT |
| 4 | VFS / dcache | dentry lock contention under parallel tool file scanning | Reduce parallel tool concurrency; use O_PATH for stat-only operations |
| 5 | Writeback / journal | fsync-per-trace-write causes 50 journal commits/second | Buffer trace writes; call fdatasync() not fsync(); consider O_DSYNC |
The system thesis is simple: agentic AI is an OS workload. The next optimization layer is not only the framework or the model — it is the kernel path underneath every agent step. The kernel should be able to explain why every step was slow. Today it cannot. The patches above are the path to closing that gap.
X thread
Post this as a thread. First tweet is the hook, replies are the technical depth.