Why hardware alone cannot fix this
We keep buying faster GPUs and denser NUMA nodes. Yet agent tool-call latency stubbornly sits in the 50–200 ms range per tool invocation. The ceiling is not silicon — it is the kernel. A CPU-centric study across Haystack RAG, Toolformer, ChemCrow, LangChain, and SWE-Agent pipelines found that tool processing on CPUs can consume up to 90.6% of total end-to-end latency, and CPU dynamic energy reaches 44% of total at large batch sizes — even when a GPU handles inference.
The kernel mediates every single one of those tool calls. More cores do not help when the scheduler serialises short bursts behind long-running inference threads. More NUMA memory does not help when munmap() on a model tensor broadcasts an IPI shootdown to every core on every socket.
EEVDF + sched_ext
The Problem
CFS — and its replacement EEVDF — distribute CPU time fairly across all runnable tasks. That is precisely the wrong policy for an agent. A tool call is a 2–10 ms burst that gates the next LLM token. A batch inference kernel is a 500 ms compute hog. CFS sees both as equal citizens and round-robins between them, imposing 50–200 ms wait times on the burst that should have been prioritised.
The deeper issue is group scheduling overhead under cgroup colocation. Research on densely packed workloads has demonstrated that context switching overhead can multiplicatively amplify — both an increase in the average cost of an individual context switch and a higher context switch rate compound together, making nodes effectively useless even when orchestrator placement is theoretically sound.
Root Cause Diagram
Proposed Patch — Agent Scheduling Class
The sched_ext BPF scheduler hook (merged in Linux 6.11) allows a fully custom dispatch policy without touching core kernel code. The proposed patch registers a new agent workload class that exploits a structural property agents have: they can declare their intent before a tool call fires, allowing the kernel to pre-stage dispatch decisions.
/* sched_ext ops — short agent bursts get local idle core first */ struct sched_ext_ops agent_ops = { .select_cpu = agent_select_cpu, .enqueue = agent_enqueue, .dispatch = agent_dispatch, .name = "agent_class", }; s32 agent_select_cpu(struct task_struct *p, s32 prev_cpu, u64 wake_flags) { if (task_is_agent_burst(p)) return scx_select_idle_cpu(p, SCX_PICK_IDLE_CORE); return prev_cpu; } void agent_enqueue(struct task_struct *p, u64 enq_flags) { if (task_is_agent_burst(p)) /* burst queue — drains before batch DSQ */ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); else /* batch inference gets 4× timeslice, lower priority */ scx_bpf_dispatch(p, BATCH_DSQ, SCX_SLICE_DFL * 4, enq_flags); } /* Userspace: agent runtime calls this before fork()-ing a tool */ int sched_setattr(pid, &(struct sched_attr){ .sched_policy = SCHED_EXT, .sched_flags = SCHED_FLAG_AGENT_CLASS, .sched_nice = -10, /* latency-nice extension (RFC v3) */ .sched_runtime = 8000000 /* 8 ms burst credit */ });
TLB Shootdowns, NUMA, and Huge Pages
The Problem
An LLM inference process allocates tens of gigabytes for KV cache and model weights. When that process unmaps a region — or when the page cache evicts it — the kernel must invalidate TLB entries on every core that might have cached the mapping. On an 8-socket NUMA system, this TLB shootdown broadcasts an IPI to every core on every socket.
This overhead has been systematically misattributed to storage or network in profiling traces. The actual data: TLB shootdowns cause up to 15.5× slowdown on modern multi-socket servers (Linux 6.5.7). At model-serving scale, with hundreds of agent sandboxes unmapping and remapping tensors in parallel, the IPI broadcast storm becomes the dominant latency source — invisible in standard profiling because it shows up as CPU idle time, not wait time.
Proposed Patch — MAP_FPR + numaPTE Replication
/* ── MAP_FPR: fast page recycling flag (mm/mmap.c) ── */ #define MAP_FPR 0x200000 /* instructs kernel: skip shootdown on page recycle */ /* Tensor allocator uses MAP_FPR + MAP_HUGETLB for model weights */ void *tensor_buf = mmap(NULL, 4UL << 30, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FPR | MAP_HUGETLB, -1, 0); /* mm/rmap.c — elide IPI if page stays in recycling pool */ static bool try_skip_tlb_flush(struct page *page, struct vm_area_struct *vma) { if (!(vma->vm_flags & VM_FPR)) return false; /* same physical page returning to same fpr_id pool → safe to skip */ if (page_in_fpr_pool(page, vma->fpr_id)) return true; return false; } /* ── numaPTE: replicate page tables per NUMA socket ── */ void numa_pte_replicate(struct mm_struct *mm, pte_t *pte, int src_nid) { int nid; for_each_online_node(nid) { if (nid == src_nid) continue; /* replicate to socket-local memory → remote TLB miss finds local PTE */ copy_pte_to_node(mm, pte, nid); } } /* ── THP enforcement for LLM model regions ── */ madvise(model_weights, model_size, MADV_HUGEPAGE); madvise(kv_cache, kv_size, MADV_HUGEPAGE); /* 2 MB TLB entries → 512× fewer TLB entries for 1 GB allocation */
AgentCgroup — Intent-Driven Resource Control
The Problem
Traditional resource controllers were designed for predictable workloads: set a memory limit, a CPU quota, the controller enforces it. Agent workloads violate every assumption. Research characterising Claude Code across 144 software engineering tasks found 1.8× variance in resource use across identical retries, and up to 20× variance across different tasks. A compiler invocation can burst to gigabytes within milliseconds.
The traditional fallback — kill and restart — is uniquely catastrophic for agents: it destroys in-process LLM context that cannot be checkpointed, imposes a triple penalty of slow recovery + lost context + non-deterministic re-execution.
Proposed Patch — Intent-Driven Cgroup API
/* ── Intent declaration API ── */ struct agent_resource_intent { __u64 mem_hint_bytes; /* expected peak RSS */ __u32 cpu_burst_ms; /* expected CPU burst duration */ __u32 io_burst_kbps; /* expected I/O rate */ __u32 intent_type; /* COMPILE | TEST | INSTALL | SEARCH | READ */ __u32 flags; }; /* Agent runtime calls BEFORE fork()-ing the tool subprocess */ int agent_cgroup_declare_intent(int cgroup_fd, const struct agent_resource_intent *intent); /* Example — agent declares compiler invocation */ agent_cgroup_declare_intent(cg_fd, &(struct agent_resource_intent){ .mem_hint_bytes = 4ULL << 30, /* 4 GB expected for javac */ .cpu_burst_ms = 30000, /* 30 s compile window */ .io_burst_kbps = 512000, .intent_type = AGENT_INTENT_COMPILE, }); /* ── memcg_bpf_ops hook — back-pressure instead of OOM kill ── */ int agent_get_high_delay_ms(struct mem_cgroup *memcg) { if (memcg_in_retry_phase(memcg)) return 50; /* gentle throttle — preserves LLM context in parent */ return 0; /* normal path */ } /* ── cgroup.freeze instead of SIGKILL for overrun ── */ echo 1 > /sys/fs/cgroup/agent-1/tool-calls/cgroup.freeze /* migrate LLM context to new cgroup, then unfreeze — no restart needed */
Zero-Copy Async I/O for Sandboxed Agents
The Problem
A multi-agent pipeline issues thousands of tool calls per second. Each tool call reads files, spawns subprocesses, writes logs, and sends data over sockets. Every one of these is a kernel crossing. io_uring's shared ring buffer model eliminates the per-operation crossing — but in containerised environments, the default Docker seccomp profile explicitly blocks io_uring_setup.
This means agents inside sandboxes — exactly where you want them — lose all async I/O benefits and fall back to blocking syscalls. This is not a hypothetical: Google's security team documented that 60% of kernel bug bounty submissions in 2022 targeted io_uring, which caused the blanket ban. The fix is not "allow everything" — it is surgical opcode-level policy via eBPF LSM.
Proposed Patch — Opcode-Level eBPF Policy + memfd IPC
/* ── per-container io_uring opcode allowlist via BPF LSM ── */ SEC("lsm/io_uring_cmd") int BPF_PROG(agent_iouring_policy, struct io_uring_cmd *cmd) { __u32 op = BPF_CORE_READ(cmd, sqe, opcode); switch (op) { case IORING_OP_READ: case IORING_OP_WRITE: case IORING_OP_RECV: case IORING_OP_SEND: case IORING_OP_SEND_ZC: /* zero-copy send — safe, high-value */ case IORING_OP_POLL_ADD: case IORING_OP_TIMEOUT: return 0; /* ALLOW */ case IORING_OP_PROVIDE_BUFFERS: /* CVE history — block */ case IORING_OP_REMOVE_BUFFERS: case IORING_OP_REGISTER: return -EPERM; /* DENY */ default: return -EPERM; } } /* ── zero-copy inter-agent IPC via memfd shared memory ── */ int agent_ipc_channel_create(const char *name, size_t sz) { int fd = memfd_create(name, MFD_CLOEXEC | MFD_ALLOW_SEALING); ftruncate(fd, sz); /* seal size — prevents resize attacks */ fcntl(fd, F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_SEAL); return fd; /* pass to peer agent over SCM_RIGHTS unix socket */ } /* Sender: write tensor result into shared region */ void *shm = mmap(NULL, sz, PROT_READ|PROT_WRITE, MAP_SHARED, ipc_fd, 0); memcpy(shm, tensor_result, tensor_size); /* zero extra copies */ /* Receiver: map same fd — no data copy between agent processes */ void *received = mmap(NULL, sz, PROT_READ, MAP_SHARED, ipc_fd, 0);
Stateful Policies and Semantic Observability
The Problem
Seccomp is the primary isolation boundary for containerised agents — used by Docker, Kubernetes, gVisor, Firecracker, and every mobile OS. Its filtering model is static: an allowlist of syscall numbers, evaluated with a classic BPF expression. It has no memory of previous calls, cannot inspect arguments in complex ways, and cannot adapt based on agent phase. Security teams face a binary choice: over-restrict (block io_uring, break performance) or under-restrict (allow everything, widen attack surface).
Simultaneously, the observability story for agents is broken. Standard APM tools trace at the function or HTTP level. Agent-specific issues — prompt injection driving unexpected syscalls, reasoning loops burning CPU, coordination bottlenecks between agents — are invisible to existing observability stacks.
Proposed Patch — Stateful Phase-Adaptive seccomp
/* ── Per-agent phase map — updated by agent runtime ── */ struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 4096); __type(key, __u32); /* agent PID */ __type(value, __u32); /* phase: PLAN | TOOL | REFLECT | IDLE */ } agent_phase_map SEC(".maps"); /* ── Audit ring buffer ── */ struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 22); /* 4 MB ring */ } audit_ring SEC(".maps"); SEC("seccomp") int agent_seccomp_filter(struct seccomp_data *ctx) { __u32 pid = bpf_get_current_pid_tgid() >> 32; __u32 phase = AGENT_PHASE_IDLE; bpf_map_lookup_elem(&agent_phase_map, &pid, &phase); switch (phase) { case AGENT_PHASE_PLAN: /* LLM inference: mmap, futex, epoll, read/write */ return plan_phase_policy(ctx->nr); case AGENT_PHASE_TOOL: /* Tool execution: subprocess, network, filesystem, io_uring (safe ops) */ return tool_phase_policy(ctx->nr); case AGENT_PHASE_REFLECT: /* Post-tool analysis: minimal surface — no network, no exec */ return reflect_phase_policy(ctx->nr); default: audit_unexpected_syscall(&audit_ring, pid, ctx->nr); return SECCOMP_RET_ERRNO | EPERM; } } /* ── Agent runtime updates its phase before each action ── */ void agent_set_phase(enum agent_phase p) { __u32 pid = getpid(); bpf_map_update(phase_map_fd, &pid, &p, BPF_ANY); } /* ── AgentSight uprobe for TLS intent capture ── */ /* bpftime userspace eBPF — 10× lower overhead than kernel uprobes */ SEC("uprobe/libssl.so:SSL_write") int capture_llm_intent(struct pt_regs *ctx) { /* intercept TLS write → extract LLM prompt → correlate with kernel trace */ return emit_intent_event(ctx); }
The Unified Picture
These five patches are not independent. They form a coherent, layered system. The scheduler knows the agent's phase via sched_ext hints. The memory manager avoids shootdowns for its tensor allocations via MAP_FPR. The cgroup controller has been pre-told the expected burst shape via agent_cgroup_declare_intent. io_uring is safely available inside the sandbox via opcode BPF policy. And eBPF is observing the semantic intent behind every syscall via stateful seccomp and AgentSight.
Together they address what the AgentCgroup authors identified as the three fundamental mismatches with existing kernel abstractions: granularity (process-level limits vs. tool-call-level demand), reactivity (history-based prediction vs. sub-second bursts), and adaptability (static limits vs. non-deterministic execution).
Research + Upstream
AgentCgroup: Understanding and Controlling OS Resources of AI Agents — Zheng et al., arXiv 2602.09345 (Feb 2026)
AgentSight: System-Level Observability for AI Agents Using eBPF — eunomia-bpf, arXiv 2508.02736 (2025)
Mitigating context switching in densely packed Linux clusters with LAGS — arXiv 2508.15703 (Aug 2025)
Skip TLB flushes for reused pages within mmaps (MAP_FPR) — arXiv 2409.10946 (Sep 2024)
numaPTE: Managing Page-Tables and TLBs on NUMA Systems — arXiv 2401.15558
Towards Agentic OS: LLM Agent Framework for Linux Schedulers (SchedCP) — arXiv 2509.01245 (Sep 2025)
Composable OS Kernel Architectures for Autonomous Intelligence — arXiv 2508.00604 (Aug 2025)
A CPU-Centric Perspective on Agentic AI — arXiv 2511.00739 (Nov 2025)
Programmable System Call Security with eBPF — arXiv 2302.10366
bpftime: Userspace eBPF Runtime — arXiv 2311.07923