Linux Kernel Bottlenecks in Agentic AI Workloads

Workload model: agentic AI as a kernel-latency workload

An agent runtime is not a batch job. It is a real-time loop: plan a next action, invoke a tool, observe the result, update state, plan again. Each iteration is short. Each iteration crosses the kernel boundary many times. And iterations are synchronous — the agent cannot begin step N+1 until step N completes. This makes latency, not throughput, the primary performance dimension.

The hot path for a single agent step looks like this at the syscall level:

Figure 1 — Full kernel path of a single agent step

Figure 1. A single agent step fans out into 5+ syscall classes, each routing through a different kernel subsystem. Every hop adds latency. Unlike a web request or a batch job, there is no pipelining — the agent is serialized on step completion.

The critical insight is latency amplification through repetition. A 2ms penalty that would be irrelevant in a batch job is applied hundreds of times per agent session. Consider a concrete step budget:

3.3ms

Scheduler tax / step

7.4ms

mmap fault storm

9.2ms

blk-mq p99 read

2.1ms

dentry cache miss

22s

× 500 steps = total

22 seconds of kernel overhead that is invisible to every application-level profiler. The model inference time was fine. The framework code was fine. The kernel was the bottleneck the whole time.

Scheduler wakeup latency: the hidden tax

Agent runtimes are structurally multi-process and multi-threaded. A typical step involves: the orchestrator thread dispatching a task to a worker pool, the worker spawning a subprocess for a tool, reading its output over a pipe, signaling a completion futex, and the orchestrator waking to process the result. That is five scheduler events per tool call, minimum.

The Linux CFS scheduler provides excellent average-case fairness and throughput. It is not optimized for the pattern agentic workloads impose: very short CPU bursts (10–500µs) with very high wakeup frequency (thousands per second) from a single latency-sensitive orchestrator thread.

Where wakeup latency comes from

Figure 2 — Anatomy of a scheduler wakeup event

Figure 2. A "simple" wakeup involves IPI delivery, runqueue lock acquisition, optional NUMA migration, a full context switch, and a TLB flush. Any step in this chain can stall. Under high CPU contention, the runqueue lock alone contributes hundreds of microseconds.

The subprocess spawn penalty

Many agent tools execute as subprocesses: shell commands, Python scripts, MCP servers, external binaries. Each fork() + execve() carries a large kernel cost that is easy to underestimate:

# Measured on a modern server: agent tool spawn breakdown
# (perf stat -e sched:sched_process_fork,sched:sched_process_exec)

fork():
  copy_process()        ~180µs  # mm_struct dup, fd table copy
  dup_mm() + cow setup  ~90µs   # page table copy, not the pages
  sched_fork()          ~15µs   # new task_struct init

execve():
  bprm_mm_init()        ~60µs   # new address space
  load_elf_binary()     ~240µs  # ELF parsing, mmap segments
  flush_old_exec()      ~45µs   # TLB full shootdown
  setup_arg_pages()     ~20µs   # stack mapping

------------------------------------------
Total fork+exec:        ~650µs  per tool invocation
× 200 tool calls:       130ms   pure kernel overhead

Real bottleneck

The flush_old_exec() call in execve() performs a full TLB shootdown — it sends IPIs to every CPU that has ever used this address space's page tables. On a 128-core server, this is 127 IPIs, each requiring a cross-core acknowledgment. On a NUMA machine, remote node IPIs take 3–5× longer than local. This single call can consume 300–500µs on a heavily loaded machine and is triggered by every subprocess spawn.

The wakeup latency histogram: what perf shows

# bpftrace: measure wakeup latency for agent orchestrator threads
$ bpftrace -e '
tracepoint:sched:sched_wakeup { @ts[args->pid] = nsecs; }
tracepoint:sched:sched_switch {
  if (@ts[args->next_pid]) {
    $lat = (nsecs - @ts[args->next_pid]) / 1000;
    @wakeup_us = hist($lat);
    delete(@ts[args->next_pid]);
  }
}'

# Example output for a Claude Code agent session:
@wakeup_us:
[1, 2)       ████████████████████  8432   # fast path: same CPU
[2, 4)       ██████████            4210   # cross-core same socket
[4, 8)       ████████              3109   # different socket
[8, 16)      ████                  1580   # under moderate load
[16, 32)     ███                    940   # runqueue lock contention
[32, 64)     ██                     470   # NUMA migration triggered
[64, 128)    █                      220   # heavy load
[128, 256)                           48   # severe contention
[256, ∞)                             12   # outliers — kill these

The long tail above 64µs is where agent step latency suffers. These are not random events — they cluster around specific causes: NUMA migration, runqueue lock contention, and CFS period expiry when the orchestrator's vruntime has drifted behind other tasks on the same CPU.

The NUMA migration trap

When an agent spawns a tool process that runs briefly on a remote NUMA node, the NUMA balancer may later decide to migrate the orchestrator thread toward that node to "follow" the memory access pattern. This migration makes the orchestrator worse: it is now far from its own hot pages and far from the GPU it may need to dispatch work to. The sched_migrate_task tracepoint fires when this happens but carries no semantic information about whether the migration is helping or hurting.

Memory: page-cache thrash, mmap faults, and TLB pressure

Agentic AI workloads have a distinctive memory access signature: a small hot set (the retrieval index, the active context, recent tool outputs) accessed with high frequency, surrounded by a large cold set (repos being scanned, historical logs, documents being parsed once) that constantly churns through the page cache.

3.1 Page-cache thrash: semantic blindness

Linux's page cache uses an LRU-based replacement policy with an active/inactive split. The kernel has no concept of why a page was accessed — whether it contains a hot retrieval index shard that will be needed in the next 100 milliseconds, or a cold log entry that will never be read again. A single large find . -name "*.py" | xargs grep "pattern" tool call can evict gigabytes of hot index pages that took seconds to fault in.

Figure 3 — Page-cache thrash: hot index eviction by cold scan

Figure 3. A single grep-style tool call reads tens of thousands of cold file pages, pushing them through the LRU active list and evicting the hot vector index pages that the next retrieval step needs. The kernel's LRU policy is optimal for general workloads but blind to the semantic priority of an agent's working set.

3.2 mmap fault storms: the worst-case retrieval pattern

Vector indexes — FAISS, Chroma, Weaviate's mmap files — are typically memory-mapped. The first access to a page in a mapped file triggers a minor fault (page in page cache, just not mapped into this process's page table) or a major fault (page not in page cache, requires a disk read). After a page-cache eviction, what were minor faults become major faults.

# Perf output during agent retrieval after page-cache eviction
$ perf stat -e exceptions:page_fault_user,exceptions:page_fault_kernel \
            -e major-faults,minor-faults \
            -p $AGENT_PID -- sleep 10

 Performance counter stats for process id '18421':

        2,847,201      minor-faults       # expected: pte updates
           14,820      major-faults       # BAD: every one is a disk read
                                          # 14820 × ~0.5ms = 7.4s stalled

  10.003215441 seconds time elapsed

14,820 major faults at ~0.5ms each is 7.4 seconds of stall time invisible to the agent. The retrieval call returns "slow" with no further attribution.

3.3 TLB pressure: the mmap tax no one talks about

Every mmap region requires TLB entries. Each context switch between processes — every subprocess spawn, every thread wakeup — may require TLB invalidations. On x86_64 with PCID disabled (the default on many distributions and VMs), every context switch flushes the entire TLB. For an agent constantly switching between its orchestrator, tool workers, and subprocesses, this means:

Figure 4 — TLB shootdown cascade per agent tool call

Figure 4. Each process boundary crossing triggers a TLB flush. With PCID disabled, these are full invalidations. A burst of parallel tool calls generates an IPI storm across all CPUs, serializing every core momentarily.

Mitigation

Enable PCID (Process Context Identifiers) by booting with pti=on and ensuring the CPU supports it (CPUID.01H:ECX.PCID[bit 17]). With PCID enabled, context switches do not flush the TLB unless the PCID is recycled. This alone can reduce wakeup latency 20–40% for agent workloads. Verify with: grep pcid /proc/cpuinfo

VFS & dentry cache: metadata pressure and lock contention

The Virtual Filesystem layer is where every file-related syscall begins. It maintains the dentry cache (directory entry cache — pathname component lookups) and the inode cache (file metadata). For agent workloads that constantly open, stat, read, and close files across large repository trees, the dentry cache is a critical hot path that is frequently overlooked.

4.1 Dentry cache exhaustion: the silent metadata bottleneck

A typical agent tool call to search a large codebase might access tens of thousands of file paths. Each path component requires a dentry lookup: /home, /home/user, /home/user/repo, /home/user/repo/src, etc. Each dentry lookup acquires the d_lock spinlock. Under parallel tool execution, this lock becomes a serialization point:

# flamegraph showing dentry lock contention during parallel grep
# (captured with: perf record -ag -- sleep 5, then stackcollapse)

sys_openat
  └─ do_filp_open
       └─ path_openat
            └─ link_path_walk
                 └─ lookup_fast         ← 34% of CPU time here
                      └─ __d_lookup_rcu
                           └─ d_hash
                                └─ hlist_bl_lock  ← spinning on this

At 34% of CPU time spent in pathname lookup, parallel tool calls are not CPU-bound on useful work — they are contending on a hash bucket spinlock in the dentry cache.

4.2 The stat() cascade

Agents frequently invoke tools that call stat() on many files: file-listing tools, context-gathering tools, build tools checking timestamps. Each stat() traverses the dentry tree to the target file, acquires inode locks to read metadata, and may trigger a VFS attribute revalidation if the filesystem's cache has expired.

# strace snippet: agent tool doing directory enumeration
openat(AT_FDCWD, "/repo/src", O_RDONLY|O_DIRECTORY) = 8    # 0.12ms
newfstatat(8, "main.py", ...) = 0                           # 0.08ms
newfstatat(8, "utils.py", ...) = 0                          # 0.08ms
newfstatat(8, "config.py", ...) = 0                         # 0.09ms
# ... × 2,400 files
# = 192ms of stat() calls before tool work begins
# mount with relatime reduces atime update locks
# but does nothing for the dentry traversal

Hidden cost: inotify and fanotify

If the agent runtime uses inotify or fanotify to watch directories (common in development tool agents), every stat(), open(), and close() by any process generates a filesystem event that the inotify file descriptor must deliver. This doubles the kernel work per file operation and, under heavy tool parallelism, can saturate the inotify event queue, causing events to be dropped silently.

4.3 The rename/unlink journal penalty

Agents that write state files atomically (write to temp, rename to final) pay a journal commit cost on every rename on ext4 or XFS. The vfs_rename() path acquires multiple directory inode locks, appends a journal entry, and on many filesystem configurations may call jbd2_log_force() to flush the journal before returning. For an agent writing 50 state files per second, this is 50 journal commits per second — a throughput limit that is completely invisible in CPU utilization but visible in latency spikes.

# Measure journal commit latency with ftrace
$ echo 'jbd2:jbd2_commit_locking' > /sys/kernel/debug/tracing/set_event
$ echo 'jbd2:jbd2_commit_flushing' >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe | head -40

# Output during agent state writes:
jbd2-sda1-8  [003]  1234.567: jbd2_commit_flushing: dev 8,1 transaction 44821
jbd2-sda1-8  [003]  1234.571: jbd2_commit_locked: ...  +4.2ms   ← each state write
jbd2-sda1-8  [003]  1234.576: jbd2_commit_flushing: dev 8,1 transaction 44822
jbd2-sda1-8  [003]  1234.579: jbd2_commit_locked: ...  +3.8ms

Block I/O: small random reads and blk-mq pathology

Agent workloads impose an I/O profile that block layer defaults are not optimized for: small (4KB–64KB), random, latency-sensitive reads with bursty, fsync-heavy writes. The blk-mq subsystem and NVMe SSD firmware are tuned primarily for throughput workloads or for mixed steady-state. Neither profile matches agent I/O.

Figure 5 — blk-mq dispatch path for small random reads

Figure 5. The blk plug mechanism is the largest source of controllable latency for agent I/O. Plugging accumulates requests for batching to improve throughput — the wrong optimization for p99 latency-sensitive small reads. Disabling plugging per-request via io_uring's IOSQE_IO_DRAIN or using the none scheduler are the fastest wins.

The blk plug problem for agent reads

The blk plug is a per-task queue that batches block requests before sending them to the hardware queue. It was designed to improve throughput on spinning disks and remains the default behavior. For NVMe SSDs with deep hardware queues and 50µs base latency, plugging a request for up to 200µs to collect a larger batch more than doubles the latency of that read. Agents reading from vector indexes suffer this penalty on every cache-miss read:

# Force latency-optimized I/O for a vector index fd
# Method 1: io_uring with RWF_NOWAIT (skip plug, return immediately if not in cache)
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, index_fd, buf, 4096, offset);
io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);  # force async, bypass plug
sqe->rw_flags = RWF_NOWAIT;               # non-blocking: return -EAGAIN if not ready

# Method 2: set scheduler to none for NVMe drives (system-wide)
echo none > /sys/block/nvme0n1/queue/scheduler

# Method 3: tune blk-mq nr_requests for lower latency
echo 64 > /sys/block/nvme0n1/queue/nr_requests  # default is 256

fsync storms from trace/log persistence

Agents generate continuous trace output: observations, tool outputs, reasoning steps, state snapshots. Frameworks that call fsync() after each write serialize every subsequent write behind a journal flush. On ext4 with data=ordered (default), fsync() blocks until the journal commits and data pages are written. Fifty trace writes per second with fsync becomes fifty journal commits per second:

# Measure fsync cost distribution
$ bpftrace -e '
tracepoint:syscalls:sys_enter_fsync { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_fsync {
  $lat = (nsecs - @start[tid]) / 1000;
  @fsync_us = hist($lat);
  delete(@start[tid]);
}'

@fsync_us:
[64, 128)     ████████████████  1,284   # normal journal commit
[128, 256)    ████████          612     # under write pressure
[256, 512)    ████              320     # journal checkpoint needed
[512, 1024)   ██                95      # jbd2 waiting for I/O
[1024, 2048)  █                 28      # device queue saturated
[2048, ∞)                       4      # severe stall

The central missing feature: step-level attribution

The kernel already emits all the events needed to fully characterize agent step latency: sched:sched_switch, exceptions:page_fault_user, block:block_rq_issue, vfs:vfs_read, jbd2:jbd2_commit_flushing. The fundamental missing piece is a stable identity that connects all these events back to the logical unit the user cares about: the agent step.

Without this, a slow step collapses to "the agent was slow." With it, you get:

# Ideal output: per-step kernel attribution report

agent-step report: step_id=142  tool="search_codebase"
  ─────────────────────────────────────────────
  syscall overhead:         1.8 ms  (open × 847, stat × 2400)
  dentry cache misses:      0.9 ms  (312 cold lookups)
  mmap major faults:        7.4 ms  (14,820 faults × 0.5ms)
  block I/O p99:            9.2 ms  (vector_index.db reads)
  scheduler wakeup tax:     0.6 ms  (fork+exec × 2)
  journal commit stalls:    0.0 ms  (no fsync this step)
  TLB shootdown time:       0.3 ms  (3 context switches)
  ─────────────────────────────────────────────
  total kernel overhead:   20.2 ms
  verdict: mmap fault storm — vector_index.db evicted from page cache
  recommendation: madvise(MADV_WILLNEED) before step or pin with mlock()

Figure 6 — Step-level attribution: propagating agent_step_id through the kernel

Figure 6. Propagating a single u64 agent_step_id through task_struct means every tracepoint that reads current→agent_step_id automatically carries attribution. The eBPF aggregation layer groups events by step_id and computes per-subsystem latency breakdowns without any kernel changes beyond the field addition and tracepoint enrichment.

Kernel-level patch directions

The goal is not to make the kernel "AI-aware" in a vague way. The goal is to add the minimal set of hooks, hints, and identity propagation needed to measure, debug, and optimize agentic latency. Each patch below is motivated by a concrete failure mode described above.

Patch A — Agent step identity propagation

Add a u64 agent_step_id field to task_struct, settable via prctl(), inherited across fork() by default, and readable by any tracepoint via current. The field is zero when not in use — zero overhead for non-agent workloads.

/* include/linux/sched.h */
struct task_struct {
  /* ... existing fields ... */
  u64  agent_step_id;  /* 0 = unset */
};

/* kernel/sys.c — prctl handler */
case PR_SET_AGENT_STEP_ID:
  current->agent_step_id = arg2;
  return 0;

/* Userspace usage */
prctl(PR_SET_AGENT_STEP_ID, step_id,
      0, 0, 0);

Patch B — Enriched mmap fault tracepoint

Extend mm_filemap_fault with inode number, page offset, fault type (minor/major), and the new agent_step_id. This closes the mmap fault storm attribution gap entirely and requires only a tracepoint field addition.

/* mm/filemap.c */
TRACE_EVENT(mm_filemap_fault,
  TP_PROTO(
    struct inode *inode,
    pgoff_t index,
    bool is_major,
    u64 latency_ns,
    u64 agent_step_id  /* NEW */
  ),
  TP_fast_assign(
    __entry->ino    = inode->i_ino;
    __entry->index  = index;
    __entry->major  = is_major;
    __entry->lat_ns = latency_ns;
    __entry->step   = agent_step_id;
  )
);

Patch C — Page-cache MADV_AGENT_HOT hint

Add a new madvise() hint MADV_AGENT_HOT that marks a mapped region as high-priority for the page reclaim algorithm. Pages marked this way are moved to the top of the active LRU list and are exempt from reclaim as long as the marking holds.

/* include/uapi/linux/mman.h */
#define MADV_AGENT_HOT    25  /* hint: critical for next step */
#define MADV_AGENT_COLD   26  /* hint: ok to evict aggressively */

/* Usage for vector index */
madvise(index_mmap, index_size,
        MADV_AGENT_HOT);

/* Usage for one-shot scan output */
madvise(scan_buf, scan_size,
        MADV_AGENT_COLD | MADV_SEQUENTIAL);

Patch D — io_uring latency-priority flag

Add IOSQE_LATENCY_PRIORITY to io_uring SQE flags. When set, the blk-mq layer submits the request without plugging and moves it to the front of the software staging queue. This gives agent retrieval reads p99 latency reduction without affecting background I/O.

/* include/uapi/linux/io_uring.h */
#define IOSQE_LATENCY_PRIORITY (1U << 7)

/* io_uring/rw.c */
if (req->flags & IOSQE_LATENCY_PRIORITY) {
  kiocb->ki_flags |= IOCB_HIPRI;
  blk_plug_invalidate(current_plug);
}

/* Userspace: latency-first read */
sqe->flags |= IOSQE_LATENCY_PRIORITY;

Patch E — Batched fsync for agent traces

Introduce O_SYNC_WINDOW open flag: writes to this fd are durable within a configurable time window (e.g. 100ms) rather than on every fsync(). Combines journal commits from concurrent trace writers, reducing journal pressure 10–50× for typical agent log rates.

/* Motivation: agent writes 50 trace events/s */
/* Each fsync = 1 journal commit = 3-8ms stall */
/* O_SYNC_WINDOW groups commits in 100ms window */

fd = open("trace.log",
          O_WRONLY | O_APPEND | O_SYNC_WINDOW,
          0644);

/* kernel batches fsync calls from this fd */
/* max durability gap = sync_window_ms sysctl */
/* safe: on crash, at most 100ms of traces lost */

Patch F — Scheduler latency-class hint

Add SCHED_NORMAL_LATENCY policy variant: like SCHED_NORMAL but with a configurable wakeup latency budget. The scheduler preferentially places these threads on a CPU with a short runqueue and avoids cross-NUMA migration. Not realtime — no preemption guarantees — just latency-aware placement.

/* kernel/sched/policy.h */
#define SCHED_NORMAL_LATENCY  7

struct sched_attr attr = {
  .size          = sizeof(attr),
  .sched_policy  = SCHED_NORMAL_LATENCY,
  .sched_runtime = 500000,    /* 500µs burst budget */
  .sched_deadline= 2000000,   /* 2ms wakeup target  */
};
sched_setattr(orchestrator_tid, &attr, 0);

Upstream path

The safest upstream sequence is: (1) ship tracepoint enrichment patches first — they are always welcomed; (2) add the madvise hints as they extend an existing ABI cleanly; (3) propose the io_uring flag — io_uring maintainer Jens Axboe is receptive to latency hints; (4) task_struct changes require broader review. The scheduler policy change is the hardest sell and would need extensive benchmark data from real agent workloads before any maintainer would consider it.

eBPF tooling: observe before optimizing

Before proposing any kernel policy change, a credible implementation must ship with observability tooling. The first tool answers exactly one question: which kernel subsystem made this agent step slow? This tool can be built today with existing tracepoints, without any of the patches above.

#!/usr/bin/env bpftrace
/*
 * agent-step-profiler.bt
 * Attribute kernel latency to agent steps by PID.
 * Usage: bpftrace agent-step-profiler.bt -p $AGENT_PID
 *
 * Today: groups by PID. After Patch A: groups by agent_step_id.
 */

BEGIN { printf("Tracing agent kernel path. Ctrl-C to report.\n"); }

/* ── SCHEDULER ─────────────────────────────────────── */
tracepoint:sched:sched_wakeup /pid == $1/ {
  @sched_wake_ts[tid] = nsecs;
}
tracepoint:sched:sched_switch /args->next_pid == $1/ {
  if (@sched_wake_ts[tid]) {
    @sched_us = hist((nsecs - @sched_wake_ts[tid]) / 1000);
    @sched_total_us += (nsecs - @sched_wake_ts[tid]) / 1000;
    delete(@sched_wake_ts[tid]);
  }
}

/* ── PAGE FAULTS ────────────────────────────────────── */
tracepoint:exceptions:page_fault_user /pid == $1/ {
  @fault_ts[tid] = nsecs;
  @fault_count++;
}
tracepoint:exceptions:page_fault_user_return /pid == $1/ {
  if (@fault_ts[tid]) {
    @fault_us = hist((nsecs - @fault_ts[tid]) / 1000);
    @fault_total_us += (nsecs - @fault_ts[tid]) / 1000;
    delete(@fault_ts[tid]);
  }
}

/* ── BLOCK I/O ──────────────────────────────────────── */
tracepoint:block:block_rq_issue /pid == $1/ {
  @bio_ts[args->sector] = nsecs;
}
tracepoint:block:block_rq_complete {
  if (@bio_ts[args->sector]) {
    $lat = (nsecs - @bio_ts[args->sector]) / 1000;
    @bio_us = hist($lat);
    @bio_total_us += $lat;
    delete(@bio_ts[args->sector]);
  }
}

/* ── VFS METADATA ───────────────────────────────────── */
tracepoint:syscalls:sys_enter_newfstatat /pid == $1/ {
  @stat_count++;
}
tracepoint:syscalls:sys_enter_openat /pid == $1/ {
  @open_ts[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_openat /pid == $1/ {
  if (@open_ts[tid]) {
    @open_us += (nsecs - @open_ts[tid]) / 1000;
    delete(@open_ts[tid]);
  }
}

END {
  printf("\n═══ AGENT KERNEL PROFILE ═══\n");
  printf("Scheduler wakeup total: %lld µs\n",    @sched_total_us);
  printf("Page fault total:       %lld µs (%lld faults)\n",
         @fault_total_us, @fault_count);
  printf("Block I/O total:        %lld µs\n",    @bio_total_us);
  printf("VFS open() total:       %lld µs\n",    @open_us);
  printf("VFS stat() count:       %lld\n",       @stat_count);
  printf("\n── Wakeup latency histogram (µs) ──\n");
  print(@sched_us);
  printf("\n── Page fault latency histogram (µs) ──\n");
  print(@fault_us);
  printf("\n── Block I/O latency histogram (µs) ──\n");
  print(@bio_us);
}

This script runs against any agent process today with no kernel changes. Run it during a representative agent session, then compare the three totals. Whichever dominates is where optimization effort should go first.

Next step after measurement

Once the dominant subsystem is identified, add targeted bpftrace scripts to drill down further. For page fault storms: trace the specific file inode causing faults. For block I/O: trace by device and request size. For scheduler: trace by runqueue depth at wakeup time. The data shapes the patch — not the other way around.

Connecting to the perf ecosystem

# Full kernel profile for an agent session using perf
$ perf record \
    -e sched:sched_switch \
    -e sched:sched_wakeup \
    -e exceptions:page_fault_user \
    -e block:block_rq_issue \
    -e block:block_rq_complete \
    -e filemap:mm_filemap_add_to_page_cache \
    -e writeback:writeback_dirty_page \
    -e syscalls:sys_enter_openat \
    -e syscalls:sys_enter_newfstatat \
    --call-graph dwarf \
    -p $AGENT_PID \
    -- sleep 30

# Generate flamegraph
$ perf script | stackcollapse-perf.pl | flamegraph.pl > agent_kernel.svg

# Aggregate by subsystem
$ perf stat \
    -e sched:sched_migrate_task \
    -e tlb:tlb_flush \
    -e major-faults,minor-faults \
    -e block:block_rq_issue \
    -p $AGENT_PID \
    -- sleep 30

Conclusion: agentic AI is an OS workload

The performance model for AI systems is changing. Batch training throughput was always a GPU problem. Inference throughput is still primarily a GPU problem. But agentic AI — the loops of tool calls, retrieval queries, state updates, and subprocess orchestration that power AI assistants, coding agents, and autonomous systems — is increasingly a kernel problem.

The kernel subsystems that matter most for agent latency, in order of typical impact:

Rank	Subsystem	Primary failure mode	Best first fix
1	mm / page cache	mmap fault storms after page-cache eviction by cold tool scans	`madvise(MADV_WILLNEED)` before retrieval step
2	Scheduler	Subprocess spawn TLB flush storms; NUMA migration of orchestrator	Enable PCID; pin orchestrator thread to fixed CPU with `pthread_setaffinity`
3	blk-mq / NVMe	blk plug adds 200µs to latency-sensitive small reads	Set scheduler to `none` for NVMe devices; use `RWF_NOWAIT`
4	VFS / dcache	dentry lock contention under parallel tool file scanning	Reduce parallel tool concurrency; use `O_PATH` for stat-only operations
5	Writeback / journal	fsync-per-trace-write causes 50 journal commits/second	Buffer trace writes; call `fdatasync()` not `fsync()`; consider `O_DSYNC`

The thesis

The system thesis is simple: agentic AI is an OS workload. The next optimization layer is not only the framework or the model — it is the kernel path underneath every agent step. The kernel should be able to explain why every step was slow. Today it cannot. The patches above are the path to closing that gap.

X thread

Post this as a thread. First tweet is the hook, replies are the technical depth.

1 / 7

Agentic AI is exposing a new systems bottleneck that nobody is talking about: the Linux kernel. Not because the kernel is slow — it isn't. Because agent loops are made of thousands of tiny, serialized, kernel-mediated operations whose latency compounds into visible user delays.

2 / 7

The hot path for a single agent step: open() + stat() × 2400 files → mmap fault storm on vector index → fork()+execve() for tool subprocess → TLB shootdown across all CPUs → blk-mq small random reads → fsync() on trace log → repeat 500 times That is a kernel-mediated workload.

3 / 7

The compounding math is brutal: 3.3ms scheduler tax (wakeup + spawn) + 7.4ms mmap faults (index evicted by grep) + 9.2ms blk-mq p99 reads + 2.1ms dentry cache misses = 22ms pure kernel overhead × 500 steps = 11 seconds of invisible tax The model was fine. The kernel was slow.

4 / 7

The 5 kernel subsystems that matter most for agents: 1. mm/page-cache — hot index evicted by cold scan 2. Scheduler — TLB flush storms from subprocess spawns 3. blk-mq — blk plug adds 200µs to small reads 4. VFS/dcache — dentry spinlock under parallel tool calls 5. Writeback — 50 journal commits/s from trace logging None of these shows up in GPU utilization graphs.

5 / 7

The biggest missing feature is attribution. The kernel emits events for all of this. What it lacks is a stable identity — an agent_step_id — that connects sched_switch, page_fault, block_rq, and vfs_read events back to the logical step that caused them. Today a slow step collapses to "Python was slow." With attribution it becomes "mmap fault storm on shard 3."

6 / 7

What I think needs to happen: • madvise(MADV_AGENT_HOT) — protect vector index pages • IOSQE_LATENCY_PRIORITY in io_uring — bypass blk plug • agent_step_id in task_struct — kernel attribution • SCHED_NORMAL_LATENCY — wakeup-aware placement • O_SYNC_WINDOW — batch fsync for trace logs None is radical. All are motivated by real workloads.

7 / 7

Agentic AI is not just a model/runtime problem. It is an operating systems problem. The kernel path underneath every agent step is the next optimization frontier. And the good news: you can measure it today with bpftrace, no kernel changes required. [link to full post]