The Kernel Is the Bottleneck: Linux Subsystems for Agentic AI

00 / CONTEXT

Why hardware alone cannot fix this

We keep buying faster GPUs and denser NUMA nodes. Yet agent tool-call latency stubbornly sits in the 50–200 ms range per tool invocation. The ceiling is not silicon — it is the kernel. A CPU-centric study across Haystack RAG, Toolformer, ChemCrow, LangChain, and SWE-Agent pipelines found that tool processing on CPUs can consume up to 90.6% of total end-to-end latency, and CPU dynamic energy reaches 44% of total at large batch sizes — even when a GPU handles inference.

The kernel mediates every single one of those tool calls. More cores do not help when the scheduler serialises short bursts behind long-running inference threads. More NUMA memory does not help when munmap() on a model tensor broadcasts an IPI shootdown to every core on every socket.

The fundamental mismatch. Agentic AI creates a new workload class: thousands of short-lived, correlated, sandboxed child processes whose combined resource demand swings 20× between tasks and 1.8× across identical retries. No existing kernel scheduler class, memory policy, or cgroup heuristic was designed for this. Every one of them fails in a different, measurable way.

Scheduler

EEVDF · sched_ext · latency-nice

Memory manager

TLB · NUMA · MAP_FPR · THP

Cgroup v2

AgentCgroup · sched_ext · memcg_bpf

io_uring / IPC

Opcode BPF policy · memfd zero-copy

eBPF / seccomp

Stateful policy · AgentSight tracing

Fig 1 — Kernel subsystems relative to the agent runtime sandbox

01 / SCHEDULER

EEVDF + sched_ext

The Problem

CFS — and its replacement EEVDF — distribute CPU time fairly across all runnable tasks. That is precisely the wrong policy for an agent. A tool call is a 2–10 ms burst that gates the next LLM token. A batch inference kernel is a 500 ms compute hog. CFS sees both as equal citizens and round-robins between them, imposing 50–200 ms wait times on the burst that should have been prioritised.

The deeper issue is group scheduling overhead under cgroup colocation. Research on densely packed workloads has demonstrated that context switching overhead can multiplicatively amplify — both an increase in the average cost of an individual context switch and a higher context switch rate compound together, making nodes effectively useless even when orchestrator placement is theoretically sound.

2.11× P99 latency improvement with sched_ext agent policy on schbench

1.79× Throughput on kernel compilation with LLM-driven scheduler

28% Smaller cluster required with latency-aware group scheduling (LAGS)

Root Cause Diagram

Fig 2 — CFS priority inversion vs sched_ext agent dispatch

Proposed Patch — Agent Scheduling Class

The sched_ext BPF scheduler hook (merged in Linux 6.11) allows a fully custom dispatch policy without touching core kernel code. The proposed patch registers a new agent workload class that exploits a structural property agents have: they can declare their intent before a tool call fires, allowing the kernel to pre-stage dispatch decisions.

kernel/sched/agent_class.bpf.c C / BPF

/* sched_ext ops — short agent bursts get local idle core first */
struct sched_ext_ops agent_ops = {
    .select_cpu   = agent_select_cpu,
    .enqueue      = agent_enqueue,
    .dispatch     = agent_dispatch,
    .name         = "agent_class",
};

s32 agent_select_cpu(struct task_struct *p, s32 prev_cpu, u64 wake_flags)
{
    if (task_is_agent_burst(p))
        return scx_select_idle_cpu(p, SCX_PICK_IDLE_CORE);
    return prev_cpu;
}

void agent_enqueue(struct task_struct *p, u64 enq_flags)
{
    if (task_is_agent_burst(p))
        /* burst queue — drains before batch DSQ */
        scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
    else
        /* batch inference gets 4× timeslice, lower priority */
        scx_bpf_dispatch(p, BATCH_DSQ, SCX_SLICE_DFL * 4, enq_flags);
}

/* Userspace: agent runtime calls this before fork()-ing a tool */
int sched_setattr(pid, &(struct sched_attr){
    .sched_policy   = SCHED_EXT,
    .sched_flags    = SCHED_FLAG_AGENT_CLASS,
    .sched_nice     = -10,        /* latency-nice extension (RFC v3) */
    .sched_runtime  = 8000000    /* 8 ms burst credit */
});

sched_ext merged — Linux 6.11 EEVDF default — Linux 6.6 latency-nice RFC v3 — pending agent_class BPF ref impl — needs RFC

02 / MEMORY MANAGER

TLB Shootdowns, NUMA, and Huge Pages

The Problem

An LLM inference process allocates tens of gigabytes for KV cache and model weights. When that process unmaps a region — or when the page cache evicts it — the kernel must invalidate TLB entries on every core that might have cached the mapping. On an 8-socket NUMA system, this TLB shootdown broadcasts an IPI to every core on every socket.

This overhead has been systematically misattributed to storage or network in profiling traces. The actual data: TLB shootdowns cause up to 15.5× slowdown on modern multi-socket servers (Linux 6.5.7). At model-serving scale, with hundreds of agent sandboxes unmapping and remapping tensors in parallel, the IPI broadcast storm becomes the dominant latency source — invisible in standard profiling because it shows up as CPU idle time, not wait time.

15.5× Slowdown from TLB shootdowns on 8-socket NUMA (Linux 6.5)

40% Reduction in memory latency with AI-optimised huge page policy

28% Real-world improvement from MAP_FPR page recycling

Fig 3 — TLB shootdown IPI broadcast on multi-socket NUMA

Proposed Patch — MAP_FPR + numaPTE Replication

mm/mmap.c · mm/rmap.c · mm/numa_pte.c C

/* ── MAP_FPR: fast page recycling flag (mm/mmap.c) ── */
#define MAP_FPR  0x200000   /* instructs kernel: skip shootdown on page recycle */

/* Tensor allocator uses MAP_FPR + MAP_HUGETLB for model weights */
void *tensor_buf = mmap(NULL, 4UL << 30,
    PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FPR | MAP_HUGETLB,
    -1, 0);

/* mm/rmap.c — elide IPI if page stays in recycling pool */
static bool try_skip_tlb_flush(struct page *page, struct vm_area_struct *vma)
{
    if (!(vma->vm_flags & VM_FPR))
        return false;
    /* same physical page returning to same fpr_id pool → safe to skip */
    if (page_in_fpr_pool(page, vma->fpr_id))
        return true;
    return false;
}

/* ── numaPTE: replicate page tables per NUMA socket ── */
void numa_pte_replicate(struct mm_struct *mm, pte_t *pte, int src_nid)
{
    int nid;
    for_each_online_node(nid) {
        if (nid == src_nid) continue;
        /* replicate to socket-local memory → remote TLB miss finds local PTE */
        copy_pte_to_node(mm, pte, nid);
    }
}

/* ── THP enforcement for LLM model regions ── */
madvise(model_weights, model_size, MADV_HUGEPAGE);
madvise(kv_cache,      kv_size,    MADV_HUGEPAGE);
/* 2 MB TLB entries → 512× fewer TLB entries for 1 GB allocation */

MAP_FPR — proposed, linux-mm@ RFC pending numaPTE — prototype on arxiv 2401.15558 THP / madvise — available today

03 / CGROUP V2

AgentCgroup — Intent-Driven Resource Control

The Problem

Traditional resource controllers were designed for predictable workloads: set a memory limit, a CPU quota, the controller enforces it. Agent workloads violate every assumption. Research characterising Claude Code across 144 software engineering tasks found 1.8× variance in resource use across identical retries, and up to 20× variance across different tasks. A compiler invocation can burst to gigabytes within milliseconds.

The traditional fallback — kill and restart — is uniquely catastrophic for agents: it destroys in-process LLM context that cannot be checkpointed, imposes a triple penalty of slow recovery + lost context + non-deterministic re-execution.

20× Resource demand variance across different agent task types

1.8× Variance across identical task retries — pure non-determinism

<1 ms eBPF in-kernel reaction time vs. ms-level userspace controllers

Fig 4 — AgentCgroup hierarchical resource domains aligned to tool-call boundaries

Proposed Patch — Intent-Driven Cgroup API

include/linux/agent_cgroup.h · kernel/cgroup/agent.c C

/* ── Intent declaration API ── */
struct agent_resource_intent {
    __u64  mem_hint_bytes;  /* expected peak RSS */
    __u32  cpu_burst_ms;    /* expected CPU burst duration */
    __u32  io_burst_kbps;   /* expected I/O rate */
    __u32  intent_type;     /* COMPILE | TEST | INSTALL | SEARCH | READ */
    __u32  flags;
};

/* Agent runtime calls BEFORE fork()-ing the tool subprocess */
int agent_cgroup_declare_intent(int cgroup_fd,
    const struct agent_resource_intent *intent);

/* Example — agent declares compiler invocation */
agent_cgroup_declare_intent(cg_fd, &(struct agent_resource_intent){
    .mem_hint_bytes = 4ULL << 30,   /* 4 GB expected for javac */
    .cpu_burst_ms   = 30000,         /* 30 s compile window */
    .io_burst_kbps  = 512000,
    .intent_type    = AGENT_INTENT_COMPILE,
});

/* ── memcg_bpf_ops hook — back-pressure instead of OOM kill ── */
int agent_get_high_delay_ms(struct mem_cgroup *memcg)
{
    if (memcg_in_retry_phase(memcg))
        return 50;  /* gentle throttle — preserves LLM context in parent */
    return 0;       /* normal path */
}

/* ── cgroup.freeze instead of SIGKILL for overrun ── */
echo 1 > /sys/fs/cgroup/agent-1/tool-calls/cgroup.freeze
/* migrate LLM context to new cgroup, then unfreeze — no restart needed */

cgroup v2 + sched_ext — available now memcg_bpf_ops — RFC proposed agent_cgroup_declare_intent — needs ABI review

04 / IO_URING + IPC

Zero-Copy Async I/O for Sandboxed Agents

The Problem

A multi-agent pipeline issues thousands of tool calls per second. Each tool call reads files, spawns subprocesses, writes logs, and sends data over sockets. Every one of these is a kernel crossing. io_uring's shared ring buffer model eliminates the per-operation crossing — but in containerised environments, the default Docker seccomp profile explicitly blocks io_uring_setup.

This means agents inside sandboxes — exactly where you want them — lose all async I/O benefits and fall back to blocking syscalls. This is not a hypothetical: Google's security team documented that 60% of kernel bug bounty submissions in 2022 targeted io_uring, which caused the blanket ban. The fix is not "allow everything" — it is surgical opcode-level policy via eBPF LSM.

The ban vs. the benefit. io_uring can deliver 3–5× I/O throughput improvement for tool-call workloads. But its PROVIDE_BUFFERS and REGISTER operations have been the source of multiple CVEs enabling local privilege escalation. The answer is not the current binary choice — it is an opcode-level BPF filter.

Proposed Patch — Opcode-Level eBPF Policy + memfd IPC

tools/agent-seccomp/io_uring_policy.bpf.c · agent_ipc.c C / BPF

/* ── per-container io_uring opcode allowlist via BPF LSM ── */
SEC("lsm/io_uring_cmd")
int BPF_PROG(agent_iouring_policy, struct io_uring_cmd *cmd)
{
    __u32 op = BPF_CORE_READ(cmd, sqe, opcode);
    
    switch (op) {
    case IORING_OP_READ:
    case IORING_OP_WRITE:
    case IORING_OP_RECV:
    case IORING_OP_SEND:
    case IORING_OP_SEND_ZC:       /* zero-copy send — safe, high-value */
    case IORING_OP_POLL_ADD:
    case IORING_OP_TIMEOUT:
        return 0;                 /* ALLOW */

    case IORING_OP_PROVIDE_BUFFERS:  /* CVE history — block */
    case IORING_OP_REMOVE_BUFFERS:
    case IORING_OP_REGISTER:
        return -EPERM;             /* DENY */
    default:
        return -EPERM;
    }
}

/* ── zero-copy inter-agent IPC via memfd shared memory ── */
int agent_ipc_channel_create(const char *name, size_t sz)
{
    int fd = memfd_create(name, MFD_CLOEXEC | MFD_ALLOW_SEALING);
    ftruncate(fd, sz);
    /* seal size — prevents resize attacks */
    fcntl(fd, F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_SEAL);
    return fd;  /* pass to peer agent over SCM_RIGHTS unix socket */
}

/* Sender: write tensor result into shared region */
void *shm = mmap(NULL, sz, PROT_READ|PROT_WRITE, MAP_SHARED, ipc_fd, 0);
memcpy(shm, tensor_result, tensor_size);  /* zero extra copies */

/* Receiver: map same fd — no data copy between agent processes */
void *received = mmap(NULL, sz, PROT_READ, MAP_SHARED, ipc_fd, 0);

io_uring — merged Linux 5.1 memfd_create — available today BPF LSM io_uring hook — RFC needed container default seccomp — blocks io_uring today

05 / EBPF + SECCOMP

Stateful Policies and Semantic Observability

The Problem

Seccomp is the primary isolation boundary for containerised agents — used by Docker, Kubernetes, gVisor, Firecracker, and every mobile OS. Its filtering model is static: an allowlist of syscall numbers, evaluated with a classic BPF expression. It has no memory of previous calls, cannot inspect arguments in complex ways, and cannot adapt based on agent phase. Security teams face a binary choice: over-restrict (block io_uring, break performance) or under-restrict (allow everything, widen attack surface).

Simultaneously, the observability story for agents is broken. Standard APM tools trace at the function or HTTP level. Agent-specific issues — prompt injection driving unexpected syscalls, reasoning loops burning CPU, coordination bottlenecks between agents — are invisible to existing observability stacks.

Fig 5 — eBPF security + observability plane for agentic AI

Proposed Patch — Stateful Phase-Adaptive seccomp

tools/agent-seccomp/stateful_policy.bpf.c C / BPF

/* ── Per-agent phase map — updated by agent runtime ── */
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 4096);
    __type(key,   __u32);   /* agent PID */
    __type(value, __u32);   /* phase: PLAN | TOOL | REFLECT | IDLE */
} agent_phase_map SEC(".maps");

/* ── Audit ring buffer ── */
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 22);  /* 4 MB ring */
} audit_ring SEC(".maps");

SEC("seccomp")
int agent_seccomp_filter(struct seccomp_data *ctx)
{
    __u32 pid   = bpf_get_current_pid_tgid() >> 32;
    __u32 phase = AGENT_PHASE_IDLE;
    bpf_map_lookup_elem(&agent_phase_map, &pid, &phase);
    
    switch (phase) {
    case AGENT_PHASE_PLAN:
        /* LLM inference: mmap, futex, epoll, read/write */
        return plan_phase_policy(ctx->nr);
    case AGENT_PHASE_TOOL:
        /* Tool execution: subprocess, network, filesystem, io_uring (safe ops) */
        return tool_phase_policy(ctx->nr);
    case AGENT_PHASE_REFLECT:
        /* Post-tool analysis: minimal surface — no network, no exec */
        return reflect_phase_policy(ctx->nr);
    default:
        audit_unexpected_syscall(&audit_ring, pid, ctx->nr);
        return SECCOMP_RET_ERRNO | EPERM;
    }
}

/* ── Agent runtime updates its phase before each action ── */
void agent_set_phase(enum agent_phase p) {
    __u32 pid = getpid();
    bpf_map_update(phase_map_fd, &pid, &p, BPF_ANY);
}

/* ── AgentSight uprobe for TLS intent capture ── */
/* bpftime userspace eBPF — 10× lower overhead than kernel uprobes */
SEC("uprobe/libssl.so:SSL_write")
int capture_llm_intent(struct pt_regs *ctx) {
    /* intercept TLS write → extract LLM prompt → correlate with kernel trace */
    return emit_intent_event(ctx);
}

eBPF seccomp-like BPF — available now AgentSight open source — github.com/eunomia-bpf stateful seccomp API — linux-security-module@ RFC needed bpftime userspace eBPF — prototype stage

06 / SYNTHESIS

The Unified Picture

These five patches are not independent. They form a coherent, layered system. The scheduler knows the agent's phase via sched_ext hints. The memory manager avoids shootdowns for its tensor allocations via MAP_FPR. The cgroup controller has been pre-told the expected burst shape via agent_cgroup_declare_intent. io_uring is safely available inside the sandbox via opcode BPF policy. And eBPF is observing the semantic intent behind every syscall via stateful seccomp and AgentSight.

Together they address what the AgentCgroup authors identified as the three fundamental mismatches with existing kernel abstractions: granularity (process-level limits vs. tool-call-level demand), reactivity (history-based prediction vs. sub-second bursts), and adaptability (static limits vs. non-deterministic execution).

Implementation order. Start with the sched_ext agent policy — it is purely in BPF, requires no kernel ABI change, and delivers the single largest latency impact. Then MAP_FPR (straightforward mm/ change). Then the io_uring opcode BPF LSM hook. The cgroup intent API and the stateful seccomp filter carry ABI implications and should go through linux-mm@ and linux-security-module@ respectively before any merge attempt.

What this does not solve. These patches do not address inference throughput (GPU scheduling, CUDA stream priorities, KV cache eviction). They do not solve cross-datacenter network bandwidth. They are scoped specifically to the kernel overhead category that appears when LLM reasoning loops drive short-burst tool calls inside isolated sandboxes at high concurrency — which is the dominant deployment pattern for agentic AI today.

07 / REFERENCES

Research + Upstream

AgentCgroup: Understanding and Controlling OS Resources of AI Agents — Zheng et al., arXiv 2602.09345 (Feb 2026)

AgentSight: System-Level Observability for AI Agents Using eBPF — eunomia-bpf, arXiv 2508.02736 (2025)

Mitigating context switching in densely packed Linux clusters with LAGS — arXiv 2508.15703 (Aug 2025)

Skip TLB flushes for reused pages within mmaps (MAP_FPR) — arXiv 2409.10946 (Sep 2024)

numaPTE: Managing Page-Tables and TLBs on NUMA Systems — arXiv 2401.15558

Towards Agentic OS: LLM Agent Framework for Linux Schedulers (SchedCP) — arXiv 2509.01245 (Sep 2025)

Composable OS Kernel Architectures for Autonomous Intelligence — arXiv 2508.00604 (Aug 2025)

A CPU-Centric Perspective on Agentic AI — arXiv 2511.00739 (Nov 2025)

Programmable System Call Security with eBPF — arXiv 2302.10366

bpftime: Userspace eBPF Runtime — arXiv 2311.07923