MANISH AI
Kernel Architecture Series · Part II
↖ Follows: "The Kernel is the Bottleneck" (Scheduler · Memory · Cgroup · io_uring · eBPF)

The Data Path

Four more Linux subsystems that silently throttle every agent loop — and the patches that fix them
Published: May 1, 2026
Kernel: Linux 6.12+
Focus: Network · Storage · Process spawning · Interrupt isolation
Read time: 23 min
Level: Senior / Principal Engineer
00

Where Part I left off

Part I of this series identified five kernel subsystems that fail under agentic workloads: the scheduler, memory manager, cgroup v2, io_uring IPC, and eBPF/seccomp. Those patches make the kernel smarter about deciding which process runs and how much memory it gets. But agents don't just think — they act. Every tool call reads a file, spawns a process, writes a result over a socket. The data path is a second, entirely distinct failure surface.

This post covers four more subsystems that are quietly destroying data-path performance for agentic workloads today. Together with Part I, they form a complete picture of the kernel changes needed to serve the agentic computing era.

Subsystem 6
Network stack
XDP · AF_XDP · BIG TCP · multi-agent packet routing
Subsystem 7
Storage / VFS
NVMe multi-queue · BPF storage · BranchFS filesystem branching
Subsystem 8
Process spawning
fork() COW · userfaultfd · branch() syscall · sandbox startup
Subsystem 9
Interrupt isolation
NOHZ_FULL · IPI suppression · IRQ affinity · task isolation
The same thesis applies. Just as with the five subsystems in Part I, the bottlenecks here are not hardware — they are 1990s-era design assumptions colliding with a workload pattern that did not exist until 2024: thousands of concurrent, short-lived, isolated, LLM-driven processes with highly bursty, correlated I/O.

01 / 04

The Network Stack

The Problem

Multi-agent systems communicate constantly: orchestrator to sub-agent, agent to tool server, agent to retrieval system, agent to agent. Each message crosses a network socket. The Linux TCP/IP stack was designed in an era when 10 Mbps was fast. It has been patched incrementally — SO_REUSEPORT, BIG TCP, GSO, GRO — but its fundamental architecture has not changed. The cost is brutal: Linux requires 4 to 8 CPU cores to fully saturate a single 100 Gbps NIC. Kernel-bypass solutions like DPDK cut this to ~40 cycles per message vs. the kernel stack's ~2,400 cycles — a roughly 60× advantage — but they require application rewrites and exclusive NIC ownership, which is incompatible with multi-tenant agent orchestration.

The measurement reality is stark. Standard socket I/O achieves P50 latency of 18.4 µs for a typical inter-agent message. XDP cuts that to 2.1 µs. DPDK reaches 850 ns. The kernel stack also collapses at 2M packets/sec — which a busy multi-agent cluster hits trivially — while XDP holds through 5M+ pps. This is not a theoretical gap; it shows up directly as agent loop tail latency, because the P99 of inter-agent communication determines the P99 of the entire plan-act loop.

60× Cycle advantage of DPDK vs kernel stack
per inter-agent message
8.8× Lower P50 latency with XDP vs
standard socket (2.1 µs vs 18.4 µs)
4–8 CPU cores Linux wastes to saturate
one 100 Gbps NIC today

Architecture: The Three-Speed Network Stack

Fig 6 — Linux network ingress path: kernel stack vs XDP vs DPDK, with agent socket profiles
NIC / Driver XDP hook (earliest kernel attach point) Kernel TCP/IP Stack ~2,400 cycles/msg IP routing TCP/UDP layer socket buffer copy context switch to user P50: 18.4 µs AF_XDP path ~200 cycles/msg XDP_REDIRECT → UMEM zero-copy to user UMEM RX/TX ring wakeup no protocol stack P50: 2.1 µs (8.8×) DPDK / kernel-bypass ~40 cycles/msg PMD polls NIC directly user-space TCP stack no kernel involvement ⚠ exclusive NIC ownership P50: 850 ns (21.6×) Joyride (proposed) — transparent DPDK-userspace bridge preserving kernel socket compatibility Dynamic fallback: kernel stack ↔ AF_XDP ↔ DPDK based on traffic class, no app changes needed

Root Cause: The Missing Middle

The kernel network stack's fundamental architecture dates to BSD 4.3. Every packet traverses the full IP, TCP, socket buffer, and system call path regardless of whether the application needs any of those layers. For agent-to-agent communication on the same physical host, this means a message travels all the way out to the NIC, through the network stack, back through the loopback interface, and through the full stack again — paying ~4,800 cycles for what should be a shared memory write. Even on multi-host deployments, the per-packet overhead compounds with NUMA socket mismatch when the NIC's interrupt is handled by a different socket than the agent process.

The fragmentation problem. High-performance alternatives (DPDK, RDMA) require application rewrites, specific hardware, or exclusive NIC access. This is incompatible with multi-tenant agent orchestration where a single NIC serves dozens of agent containers simultaneously. The kernel needs a transparent bypass path, not a fork-the-ecosystem solution.

Proposed Patch A — AF_XDP Agent Socket Class

The proposed patch introduces an agent-specific socket class that automatically routes eligible inter-agent traffic through an AF_XDP fast path. The key innovation is the XDP steering program: it classifies packets by destination port and agent cgroup ID and redirects them into per-agent UMEM rings, bypassing the TCP/IP stack while preserving full socket API compatibility for the application.

net/xdp/agent_socket.bpf.c · tools/agent-net/xdp_agent_steer.c C / XDP-BPF
/* ── XDP steering program: classify agent traffic, bypass stack ── */
SEC("xdp")
int agent_xdp_steer(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;

    if (eth + 1 > data_end) return XDP_PASS;

    struct iphdr *ip = data + sizeof(*eth);
    if (ip + 1 > data_end) return XDP_PASS;
    if (ip->protocol != IPPROTO_UDP) return XDP_PASS;

    struct udphdr *udp = (void *)ip + ip->ihl * 4;
    if (udp + 1 > data_end) return XDP_PASS;

    /* Agent-to-agent traffic uses port range 40000-49999 */
    __u16 dport = bpf_ntohs(udp->dest);
    if (dport < 40000 || dport > 49999) return XDP_PASS;

    /* Look up the target agent's AF_XDP socket */
    __u32 key = dport - 40000;
    struct xsk_socket **xsk = bpf_map_lookup_elem(&agent_xsk_map, &key);
    if (!xsk) return XDP_PASS;  /* fall through to kernel stack */

    /* Zero-copy redirect into agent's UMEM ring */
    return bpf_redirect_map(&agent_xsk_map, key, XDP_PASS);
}

/* ── Userspace: agent registers its AF_XDP socket ── */
int agent_net_init(int agent_port, const char *ifname)
{
    /* Allocate UMEM — shared between kernel and userspace, zero-copy */
    struct xsk_umem_config umem_cfg = {
        .fill_size      = 4096,
        .comp_size      = 4096,
        .frame_size     = XSK_UMEM__DEFAULT_FRAME_SIZE,
        .frame_headroom = 0,
    };
    void *umem_area = mmap(NULL, NUM_FRAMES * FRAME_SIZE,
                         PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

    xsk_umem__create(&umem, umem_area, UMEM_SIZE, &fq, &cq, &umem_cfg);

    /* Bind AF_XDP socket to queue — this is what enables zero-copy */
    struct xsk_socket_config xsk_cfg = {
        .rx_size   = 4096,
        .tx_size   = 4096,
        .libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
        .xdp_flags = XDP_FLAGS_DRV_MODE,   /* driver mode: lowest overhead */
        .bind_flags = XDP_ZEROCOPY,
    };
    xsk_socket__create(&xsk, ifname, queue_id, umem, &rx, &tx, &xsk_cfg);

    /* Register socket in BPF map so XDP steering finds it */
    bpf_map_update_elem(xsk_map_fd, &key, &xsk_fd, BPF_ANY);
    return 0;
}

Proposed Patch B — Loopback Fast Path for Same-Host Agents

Same-host inter-agent communication (the dominant case in single-machine deployments) should never leave the CPU. The proposed loopback fast path patch introduces a new socket option SO_AGENT_LOCAL that routes traffic through a shared memory ring buffer when source and destination are in the same agent cgroup, bypassing the NIC entirely and reducing latency to sub-microsecond.

net/core/agent_lo.c C
/* ── Same-host agent fast path: bypass loopback NIC ── */
#define SO_AGENT_LOCAL  75   /* new socket option */

struct agent_lo_ring {
    atomic_t    head, tail;
    size_t      size;
    void       *buf;          /* backed by memfd, mapped in both agents */
    spinlock_t  lock;
};

/* sendmsg path: if peer is in same agent cgroup → write to ring directly */
int agent_lo_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
{
    struct agent_lo_ring *ring = sock_get_agent_ring(sk);
    if (!ring || !agent_same_cgroup(sk, ring->peer_sk))
        return sock_sendmsg_default(sk, msg);  /* fall through */

    /* Spin-free enqueue into shared ring — no syscall on peer side */
    return agent_ring_enqueue(ring, msg->msg_iter.iov->iov_base, len);
}

/* Agent opts in: setsockopt(fd, SOL_SOCKET, SO_AGENT_LOCAL, &peer_fd, ...) */
int agent_lo_connect(int local_fd, int peer_fd, size_t ring_sz)
{
    int shm = memfd_create("agent-lo-ring", MFD_CLOEXEC);
    ftruncate(shm, ring_sz);
    /* Map same physical pages in both sockets — truly zero-copy */
    setsockopt(local_fd, SOL_SOCKET, SO_AGENT_LOCAL, &shm, sizeof(shm));
    setsockopt(peer_fd,  SOL_SOCKET, SO_AGENT_LOCAL, &shm, sizeof(shm));
    return shm;
}
XDP + AF_XDP — merged Linux 4.18+ BIG TCP — merged Linux 5.19 XDP agent steering BPF — prototype stage SO_AGENT_LOCAL loopback — RFC needed Joyride transparent bridge — KISV '25 paper

02 / 04

Storage & Filesystem

The Problem: Two Independent Failures

The storage subsystem fails agentic workloads in two completely different ways. The first failure is well-known but under-fixed: on modern NVMe drives with hardware latency of ~80 µs, the Linux software stack — VFS, page cache, block layer, driver — adds comparable overhead. A systematic study measured that the filesystem (ext4) and the user-to-kernel-space transition together account for the majority of software overhead per random 512-byte read. With new Optane-class NVMe reaching ~8 µs hardware latency, the software path overhead becomes dominant. Agents doing frequent small reads — reading tool output files, checking test results, reading log snippets — pay this tax on every operation.

The second failure is less discussed and more novel: agents performing parallel exploration (running multiple solution strategies simultaneously and committing the winner) have no kernel-supported way to create isolated filesystem snapshots. Current agent frameworks resort to git stashing, temporary directories, or container clones. Research from Multikernel Technologies (March 2026) demonstrates that these ad hoc solutions incur race windows between steps, error-prone cleanup on partial failure, and ordering dependencies that defeat correctness. What agents actually need is a first-class filesystem branching primitive.

50% Of NVMe I/O latency is now software
overhead on Optane-class devices
<350 µs BranchFS filesystem branch creation
independent of directory size
O(1) Branch creation cost with proposed
branch() syscall + BranchFS

Architecture: VFS Fast Path + BranchFS

Fig 7 — Storage path today vs proposed, and BranchFS agent exploration lifecycle
READ() LATENCY BREAKDOWN User space / syscall entry VFS layer ~12 µs ext4 filesystem ~18 µs (dominant layer) Block layer + elevator ~6 µs NVMe driver submit ~4 µs NVMe hardware ~8–80 µs Software adds ~42 µs on 8 µs Optane Fix: BPF storage hooks + io_uring O_DIRECT bypass BRANCHFS: FORK → EXPLORE → COMMIT Origin (frozen workspace) Branch A CoW filesystem view + process group ✓ solution found Branch B CoW filesystem view + process group ⊘ -ESTALE (sibling) Branch C ⊘ abort Committed to origin O(1) branch creation · <350 µs · atomic commit first-commit-wins · siblings auto-invalidated (-ESTALE) Proposed: branch() syscall (Linux RFC, March 2026)

Proposed Patch A — BPF Storage Hooks (exokernel-inspired)

The BPF storage hook approach attaches eBPF programs at three points in the I/O path to short-circuit unnecessary processing. For agent workloads doing O_DIRECT reads (common for reading tool output files), the page cache and VFS metadata path add pure overhead. The patch allows an eBPF program to declare "this file region is stable for the next N milliseconds" — enabling the kernel to skip VFS revalidation and serve directly from a registered buffer pool.

fs/bpf_storage.c · include/linux/bpf_storage.h C / BPF
/* ── BPF storage hook: skip VFS revalidation for stable agent files ── */
SEC("fentry/__x64_sys_read")
int BPF_PROG(agent_read_fast, struct pt_regs *regs)
{
    int fd  = PT_REGS_PARM1(regs);
    struct file *f = bpf_get_file(fd);
    if (!f) return 0;

    /* If file is in agent read-stable map → bypass page cache validation */
    __u64 ino = BPF_CORE_READ(f, f_inode, i_ino);
    struct agent_file_hint *hint = bpf_map_lookup_elem(&stable_files, &ino);
    if (hint && hint->expires_ns > bpf_ktime_get_ns()) {
        /* Serve from registered buffer — no VFS, no page cache lookup */
        bpf_override_return(regs, agent_buf_read(hint, regs));
    }
    return 0;
}

/* Agent marks a file stable before a batch of reads */
int agent_file_mark_stable(int fd, __u64 ttl_ms)
{
    struct stat st; fstat(fd, &st);
    struct agent_file_hint hint = {
        .expires_ns = clock_gettime_ns(CLOCK_MONOTONIC) + ttl_ms * 1000000,
        .buf_fd     = agent_register_buf(fd),
    };
    return bpf_map_update_elem(stable_fd, &st.st_ino, &hint, BPF_ANY);
}

/* ── NVMe multi-queue: align queues to agent NUMA socket ── */
/* /sys/block/nvme0n1/queue/nr_requests — per-socket queue affinity */
void agent_nvme_queue_affinity(int agent_numa_node)
{
    /* Pin NVMe completion interrupt to agent's NUMA socket */
    /* Avoids cross-socket NUMA penalty on I/O completion */
    set_irq_affinity(nvme_irq, numa_node_to_cpumask(agent_numa_node));
    blk_mq_map_hw_queues(q, nvme_dev, agent_numa_node);
}

Proposed Patch B — branch() Syscall for Agentic Exploration

The branch() syscall is the most architecturally significant proposal in this series. It creates N child processes simultaneously, each in an independent copy-on-write filesystem view backed by BranchFS, with first-commit-wins coordination enforced in kernel space. The winning branch atomically commits its filesystem changes; siblings receive -ESTALE on their next commit attempt and are cleanly terminated.

kernel/branch.c · include/uapi/linux/branch.h [proposed RFC, March 2026] C
/* ── branch() syscall interface (proposed) ── */
struct branch_attr {
    __u32  n_branches;     /* number of parallel exploration paths */
    __u32  flags;          /* BR_MEMORY | BR_COMMIT_FIRST_WINS | BR_NESTABLE */
    __s32  mount_fd;       /* BranchFS mount to branch from */
    __u64  timeout_ns;     /* auto-abort if no commit within timeout */
};

/*
 * Returns: n_branches pids to parent (all children return 0)
 * Each child gets: own CoW filesystem view + own process group
 * First BR_COMMIT wins; siblings get -ESTALE on next commit
 */
long sys_branch(pid_t *pids, const struct branch_attr __user *attr);

/* ── Usage: parallel code generation + test ── */
int agent_explore_parallel(int workspace_fd)
{
    pid_t pids[4];
    struct branch_attr ba = {
        .n_branches = 4,
        .flags      = BR_MEMORY | BR_COMMIT_FIRST_WINS | BR_NESTABLE,
        .mount_fd   = workspace_fd,
        .timeout_ns = 30000000000ULL,  /* 30 second timeout */
    };

    long rc = syscall(SYS_BRANCH, pids, &ba);
    if (rc > 0) {
        /* Parent: pids[] now holds all child PIDs */
        /* Dispatch different strategies to each child */
        agent_dispatch_strategy(pids[0], STRATEGY_ITERATIVE);
        agent_dispatch_strategy(pids[1], STRATEGY_RECURSIVE);
        agent_dispatch_strategy(pids[2], STRATEGY_FUNCTIONAL);
        agent_dispatch_strategy(pids[3], STRATEGY_OPTIMIZED);
        return wait_for_first_commit(pids, 4);
    }
    if (rc == 0) {
        /* Child: run exploration, then commit */
        generate_and_test();
        long commit_rc = syscall(SYS_BRANCH_COMMIT, BR_COMMIT);
        if (commit_rc == -ESTALE) {
            /* Another sibling won — clean exit */
            _exit(0);
        }
    }
    return rc;
}
NVMe multi-queue — available since Linux 3.13 io_uring O_DIRECT — available today BPF storage hooks — experimental (arxiv 2102.12922) BranchFS FUSE impl — open source, github.com/multikernel/branchfs branch() syscall — RFC under review, Agentic OS Workshop March 2026

03 / 04

Process Spawning

The Problem

Every agent tool call that runs external code — a compiler, a test suite, a shell script, a Python interpreter — requires spawning a sandboxed process. The dominant industry answer is to wrap each in a container or microVM. But as Multikernel Technologies documented, this is an abstraction inversion: the process is already the OS's unit of isolation. Containers add a daemon, an image, volume mount configuration, and startup latency. MicroVMs add a guest kernel, a VMM, and boot overhead. For interactive agent workflows, the round-trip to a VM for every tool call is impractical.

The correct primitive is fork() with copy-on-write: the child instantly inherits the parent's loaded libraries, model weights, and warm caches, sharing pages via CoW until it writes. Sub-millisecond sandboxes for AI agents via copy-on-write forking (ZeroBoot, March 2026) demonstrates this is already achievable. But two kernel gaps remain: first, there is no kernel-enforced way to guarantee that a frozen "template" process stays immutable while being forked by hundreds of children; second, userfaultfd-based copy-on-demand is still too slow for the cold-page case when a child's first write touches a new region.

<1 ms CoW fork() sandbox creation
vs. 200–800 ms container start
90% Reduction in sandbox cold-start
with pre-warmed fork pools (GKE data)
~0 Additional memory per CoW fork until
first write (vs MB for container overlay)

Architecture: Fork-Pool with Frozen Templates

Fig 8 — CoW fork pool architecture for sub-millisecond agent sandbox spawning
Template process FROZEN · MMAP_NORESERVE · immutable loaded: Python + libs + agent SDK + model weights PRE-WARMED FORK POOL (proposed kernel support) Idle fork #1 SIGSTOP'd · CoW no unique pages yet Idle fork #2 SIGSTOP'd · CoW no unique pages yet Idle fork #N SIGSTOP'd · CoW no unique pages yet ··· Tool call: run_tests() SIGCONT fork #1 <1 ms to first test Proposed: clone(CLONE_FROZEN_TEMPLATE | CLONE_AGENT_SANDBOX) — kernel enforces template immutability + applies seccomp profile atomically

Proposed Patch — CLONE_FROZEN_TEMPLATE + userfaultfd Fast Path

The patch introduces two new clone() flags. CLONE_FROZEN_TEMPLATE marks the child as a read-only template that can be forked but not directly written to — the kernel enforces this, preventing accidental mutation. CLONE_AGENT_SANDBOX atomically applies the pre-compiled seccomp BPF profile and cgroup assignment at fork time rather than requiring a separate seccomp syscall, eliminating the race window between fork and policy application that current container runtimes close with expensive locking.

kernel/fork.c · include/uapi/linux/sched.h C
/* ── New clone flags for agent sandbox spawning ── */
#define CLONE_FROZEN_TEMPLATE   0x00800000  /* child is CoW template, read-only */
#define CLONE_AGENT_SANDBOX     0x01000000  /* apply seccomp + cgroup atomically */

/* ── Template creation: freeze the Python + SDK process ── */
int template_pid = clone(NULL, NULL,
    CLONE_FROZEN_TEMPLATE | CLONE_VM | SIGCHLD,
    &(struct clone_args){
        .flags     = CLONE_FROZEN_TEMPLATE | CLONE_VM,
        .stack     = 0,
        .stack_size= 0,
    }, sizeof(struct clone_args));

/* Kernel enforces: any write to template VMA returns -EPERM */
/* Template pages never get CoW'd — they are permanently shared */

/* ── Sandbox spawn from template: <1 ms total ── */
struct agent_sandbox_spec spec = {
    .seccomp_prog_fd  = precompiled_seccomp_fd,   /* pre-compiled BPF prog */
    .cgroup_fd        = agent_tool_cgroup_fd,
    .mount_ns_fd      = workspace_ns_fd,
    .net_ns_fd        = agent_net_ns_fd,
    .uid_map          = "0 100000 65536",
};

int sandbox_pid = clone(tool_main, stack,
    CLONE_AGENT_SANDBOX | CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | SIGCHLD,
    &(struct clone_args){
        .flags       = CLONE_AGENT_SANDBOX | ...,
        .agent_spec  = (__u64)&spec,  /* atomic: seccomp+cgroup at clone time */
    }, sizeof(struct clone_args));

/* ── userfaultfd fast path: pre-populate CoW pages speculatively ── */
int ufd = userfaultfd(UFFD_FEATURE_PAGEFAULT_FLAG_WP |
                        UFFD_FEATURE_MINOR_HUGETLBFS);
/* Register agent heap region — kernel notifies us of first-write faults */
struct uffdio_register reg = {
    .range  = {.start = heap_base, .len = heap_size},
    .mode   = UFFDIO_REGISTER_MODE_WP,
};
ioctl(ufd, UFFDIO_REGISTER, ®);
/* Pre-populate likely-written pages before tool starts */
ioctl(ufd, UFFDIO_COPY, &(struct uffdio_copy){
    .dst  = heap_base,
    .src  = template_heap_base,
    .len  = likely_write_region,
    .mode = UFFDIO_COPY_MODE_WP,  /* write-protect, not CoW on first fault */
});
fork() CoW — available since Linux 2.0 userfaultfd — merged Linux 4.3 clone3() with clone_args — merged Linux 5.3 CLONE_FROZEN_TEMPLATE — proposed RFC CLONE_AGENT_SANDBOX atomic seccomp — proposed RFC ZeroBoot CoW sandbox — open source (March 2026)

04 / 04

Interrupt Isolation

The Problem

Linux's periodic timer interrupt (the "tick") fires at 250–1000 Hz on every active CPU. This interrupt serves essential kernel housekeeping: timekeeping, task scheduling, bandwidth enforcement, RCU garbage collection. For general-purpose workloads, it is harmless. For a core running LLM inference — executing dense matrix operations in tight loops — each tick is a cache-evicting interrupt that adds latency noise to every token generation window.

The situation is compounded by inter-processor interrupts (IPIs). In Linux, IPIs are used broadly for TLB shootdowns (covered in Part I), scheduler balancing, and RCU grace period synchronization. Research on sub-0.5 µs deterministic response (arxiv 2509.03855) found that even cores explicitly isolated with isolcpus and NOHZ_FULL still receive periodic ticks and arbitrary IPIs from other subsystems. Existing tools reduce but cannot eliminate these interruptions. The result for agent deployments is measurable jitter in P95+ token generation latency and interference between co-located agent contexts.

1,000× Tick interrupts per second per core
at HZ=1000 (common server config)
3 sources Residual interrupt sources even with
isolcpus + NOHZ_FULL: tick, IPI, NMI
<0.5 µs Target response determinism achievable
with full IPI suppression (2025 research)

Architecture: Four-Layer Interrupt Isolation Stack

Fig 9 — Interrupt isolation: from noisy defaults to agent-optimised core partitioning
DEFAULT — ALL CORES NOISY Timer tick (HZ=250–1000) scheduling, timekeeping, RCU, bandwidth TLB shootdown IPIs (from any core) Scheduler balancing IPIs RCU grace period IPIs Device IRQs (NIC, NVMe, USB) Result: jitter every ~1–4 ms P99 token latency spike · cache eviction proposed patches PATCHED — AGENT CORES SILENT NOHZ_FULL + task_isolation=1 0 ticks while LLM running on isolated core MAP_FPR (from Part I) suppresses TLB IPIs sched_setaffinity + SD_ASYM_PACKING CONFIG_RCU_NOCB_CPU — offload RCU irqaffinity= migration to housekeeping CPUs Result: <0.5 µs deterministic jitter Flat P99 · warm L1/L2 · no IPI on LLM cores

Proposed Patch — agent_isolate() Systemcall + eBPF IPI Filter

Rather than requiring operators to manually assemble the eight-parameter boot command line combination that achieves full core isolation, the proposed patch introduces a single agent_isolate() helper that atomically applies the full isolation stack to a set of cores. The second part introduces an eBPF IPI filter: an eBPF program attached to the IPI dispatch path that can selectively suppress or defer IPI types on isolated cores without requiring kernel recompilation.

kernel/agent_isolate.c · boot: kernel cmdline configuration C + bash
/* ── Boot-time: configure isolation partition ── */
/* Kernel cmdline — isolate cores 8-31 for agent inference, keep 0-7 for housekeeping */

isolcpus=domain,managed_irq,8-31
nohz_full=8-31
rcu_nocbs=8-31
irqaffinity=0-7
rcu_nocb_poll

/* After boot: agent_isolate() applies remaining runtime configuration */

/* ── agent_isolate() proposed API ── */
struct agent_isolation_spec {
    cpu_set_t  inference_cpus;      /* cores for LLM inference */
    cpu_set_t  tool_cpus;           /* cores for tool execution */
    cpu_set_t  housekeeping_cpus;   /* cores for OS housekeeping */
    __u32      flags;               /* AI_ISO_TICK | AI_ISO_IPI | AI_ISO_RCU */
};

int agent_isolate(const struct agent_isolation_spec *spec)
{
    /* 1. Apply task_isolation to inference cores */
    cpumask_copy(&task_isolation_map, &spec->inference_cpus);

    /* 2. Move all existing IRQs off inference cores */
    for_each_irq(irq) {
        if (irq_can_set_affinity(irq))
            irq_set_affinity(irq, &spec->housekeeping_cpus);
    }

    /* 3. Enable per-core nohz_full dynamically (no reboot) */
    tick_nohz_full_add_cpus_to(&spec->inference_cpus);

    /* 4. Load eBPF IPI suppression filter */
    return agent_ipi_bpf_attach(spec);
}

/* ── eBPF IPI filter: suppress TLB + sched-balance IPIs on inference cores ── */
SEC("ipi/tlb_flush")
int suppress_tlb_ipi_on_inference_cores(struct ipi_data *d)
{
    int cpu = smp_processor_id();
    if (cpumask_test_cpu(cpu, &inference_cores) &&
        d->type == IPI_TYPE_TLB_FLUSH &&
        !ipi_is_urgent(d))
    {
        /* Defer: queue in per-cpu ring, flush at scheduling boundary */
        agent_ipi_defer(cpu, d);
        return IPI_SUPPRESS;  /* don't interrupt inference thread now */
    }
    return IPI_DELIVER;
}

/* ── Runtime: pin LLM inference thread to isolated partition ── */
cpu_set_t inference_set;
CPU_ZERO(&inference_set);
for (int c = 8; c <= 31; c++) CPU_SET(c, &inference_set);
sched_setaffinity(llm_inference_pid, sizeof(cpu_set_t), &inference_set);
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0);
Available today (partially). The boot command line parameters — isolcpus, nohz_full, rcu_nocbs, irqaffinity — are all available in Linux 6.x kernels today and deployable immediately. The agent_isolate() runtime API and eBPF IPI filter are the new contributions requiring upstream RFC.
isolcpus + NOHZ_FULL — Linux 4.17+ rcu_nocbs + irqaffinity — available today task_isolation (PR_TASK_ISOLATION) — out-of-tree patch, RFC pending agent_isolate() runtime API — new proposal eBPF IPI suppression hook — new proposal

05

The Complete Picture

Across both posts in this series, we have identified nine kernel subsystems where 1990s-era design assumptions collide with the agentic AI workload pattern. Each failure is independently measurable. Each proposed patch is independently deployable. And they compose: the sched_ext agent policy from Part I ensures tool call processes get CPU quickly; MAP_FPR from Part I prevents TLB shootdowns when they read their results; the BPF storage hooks from this post cut VFS overhead on those reads; the interrupt isolation from this post keeps the LLM inference core quiet while all of this happens.

Implementation priority across both posts. Deploy today (config only): interrupt isolation boot parameters, THP madvise for tensors. Deploy this quarter (BPF, no ABI change): sched_ext agent policy, XDP steering, io_uring opcode allowlist. Submit RFC: MAP_FPR, BPF storage hints, CLONE_FROZEN_TEMPLATE. Long-cycle upstream: AgentCgroup intent API, branch() syscall, stateful seccomp, agent_isolate() runtime API.
← Part I: The CPU Path
Scheduler · Memory · Cgroup · io_uring · eBPF
How the kernel decides · who runs · on what memory
Part II: The Data Path (this post)
Network · Storage · Process spawning · Interrupts
How the kernel moves data · spawns sandboxes · stays quiet
What this series does not cover. GPU scheduling (CUDA stream priorities, MPS), distributed consensus between agent clusters, KV cache eviction strategies, model weight checkpointing, and cross-datacenter network topology. Those are real problems — but they live above the kernel boundary, in userspace runtimes and orchestration layers. The nine subsystems documented across these two posts are the foundation those layers sit on. Get the foundation right first.

06

References

Joyride: Rethinking Linux's network stack design for better performance, security, and reliability — Du & Nikolaev, KISV '25, arXiv 2509.25015

A Wake-Up Call for Kernel-Bypass on Modern Hardware — Jasny et al., DAMON '25 (TU Darmstadt)

Understanding Delays in AF_XDP-based Applications — arXiv 2402.10513 (Feb 2024)

BPF for storage: an exokernel-inspired approach — arXiv 2102.12922

A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs (Async I/O Stack) — Lee et al., USENIX ATC '19

Rearchitecting Linux Storage Stack for µs Latency and High Throughput (blk-switch) — Aggelos et al., Cornell / OSDI

Fork, Explore, Commit: OS Primitives for Agentic Exploration — Wang et al., Agentic OS Workshop March 2026, arXiv 2602.08199

Processes Are All You Need for AI Sandboxing — Multikernel Technologies, March 2026

Sub-millisecond sandboxes for AI agents via copy-on-write forking (ZeroBoot) — Open Source Projects, March 2026

Agentic AI on Kubernetes and GKE (Agent Sandbox, Pod Snapshots) — Google Cloud Blog, November 2025

Towards Deterministic Sub-0.5 µs Response on Linux through Interrupt Isolation — arXiv 2509.03855 (Sep 2025)

A full task-isolation mode for the kernel — LWN.net, April 2020 (Alex Belits RFC)

AgentCgroup: Understanding and Controlling OS Resources of AI Agents — Zheng et al., arXiv 2602.09345 (Feb 2026)

A CPU-Centric Perspective on Agentic AI — arXiv 2511.00739 (Nov 2025)

Express Data Path — Wikipedia / kernel.org documentation (XDP merged Linux 4.8, AF_XDP Linux 4.18)