Abstract Agentic AI workloads are not merely CPU-bound, memory-bound, or I/O-bound. They are completion-path bound. A tool reply, NVMe read, timer expiry, or asynchronous I/O completion enters the kernel through an interrupt or completion path, traverses softirq or block completion machinery, wakes a userspace waiter, and depends on the scheduler to resume the agent control loop. Each stage may add only tens of microseconds — but agent loops execute hundreds or thousands of steps. This creates latency amplification: small kernel delays compound into visible task latency. This note proposes a kernel-stack design that treats agent steps as first-class latency-critical completion chains rather than unrelated IRQs, softirqs, block completions, and wakeups.

§ 01From Device IRQs to Completion Chains

The mistake is to ask, "How do we make NIC IRQs faster?" or "How do we make NVMe IRQs faster?" The better question is:

How do we make the complete device → kernel → scheduler → userspace path predictable for an agent step?

In an agentic runtime, a tool result may arrive through the network, a retrieval page from NVMe, a local database operation from io_uring, or a timer may simply expire. Different sources — same conceptual path:

asynchronous event completes
  ↓
kernel completion handling
  ↓
waiter becomes runnable
  ↓
scheduler places it on a CPU
  ↓
userspace agent loop resumes

The kernel already has strong machinery here: the generic IRQ layer abstracts device interrupts, NAPI handles network event processing via interrupt-then-poll, and blk-mq exploits multi-queue parallelism for fast storage. The issue is not that these systems are bad. It is that they are optimized for general-purpose fairness and throughput, not per-agent-step tail latency.

UNIFIED COMPLETION CHAIN Event Source NIC / NVMe / Timer IRQ / Poll MSI-X, NAPI, blk Completion skb / req / CQE Wakeup wake_up_process() Agent Resumes epoll / io_uring PROPOSED: agent_id + step_id + latency_class + deadline/budget — propagated along the chain
Fig 1 — The real optimization target is the completion chain, not any single IRQ source.

§ 02Network Path: IRQ → NAPI → Socket → Wakeup

For a networked tool response, the simplified path:

NIC DMA writes packet into RX ring
  ↓  MSI-X interrupt
  ↓  generic IRQ layer
  ↓  driver hardirq handler → napi_schedule()
  ↓  NET_RX softirq → net_rx_action()
  ↓  driver NAPI poll receives packets
  ↓  TCP/IP stack attaches data to socket
  ↓  epoll/io_uring waiter is woken
  ↓  scheduler runs agent thread
  ↓  userspace parses tool reply

NAPI is the correct model for high-rate network events: device interrupts, then host-scheduled polling. But for an agent loop, excessive batching or delayed ksoftirqd execution becomes part of the user-visible step latency.

NETWORK + DISK — SHARED COMPLETION TAIL NETWORK NIC MSI-X NAPI / softirq socket ready DISK NVMe MSI-X blk-mq complete CQE / read ready Wakeup scheduler Agent userspace
Fig 2 — NIC and NVMe enter through different subsystems but converge at wakeup and scheduling.

§ 03Disk Path: NVMe IRQ → blk-mq → bio/io_uring → Wakeup

A vector index read, small SQLite/LMDB page fetch, local document retrieval, or mmap major fault can all arrive through the block completion path:

NVMe controller completes command
  ↓  MSI-X interrupt → nvme_irq()
  ↓  blk_mq_complete_request()
  ↓  bio_endio() / request completion
  ↓  io_uring CQE or blocking read wakeup
  ↓  scheduler runs agent thread
  ↓  userspace resumes with data

blk-mq is designed for modern high-IOPS storage using software and hardware queues to exploit parallelism. But the agentic case is not simply "maximize IOPS." It often needs small, blocking, latency-critical reads to complete predictably even when background writeback, logging, indexing, or scans are active.

§ 04Where the Kernel Currently Lacks Context

Today, the kernel sees only local, per-event information:

Subsystem What it sees What the agent runtime needs
generic IRQ device interrupt vector, irq_desc, handler is this IRQ completing an agent-critical step?
NAPI / softirq packet budget, poll state, backlog will batching delay a waiting control loop?
blk-mq request queues, hw queues, completion is this small read blocking a step?
scheduler task runnable state, CPU selection, fairness should this agent resume on a warm, IRQ-quiet CPU?
epoll / io_uring completion queue or readiness event which agent step did this unblock?
Key gap: the kernel cannot distinguish an agent-critical completion from background telemetry, log flushes, bulk scans, or maintenance I/O.

§ 05Proposed Abstraction: Agent Completion Context

The central proposal: add a small, budgeted, propagation-friendly context object for agent-critical completion chains.

struct agent_latency_ctx {
    u64 agent_id;        /* runtime / session / agent identity */
    u64 step_id;         /* current reasoning / tool step */
    u8  latency_class;   /* normal | agent-interactive | background */
    u64 deadline_ns;     /* optional soft deadline */
    u64 budget_ns;       /* cgroup-controlled latency budget */
};

This context is attached at the userspace boundary and propagated only along paths that can prove causality:

Userspace Sources
  • io_uring SQE tag
  • socket option / cmsg
  • epoll wait context
  • mmap / page fault context
  • prctl / sched_attr hint
Kernel Propagation
  • skb metadata (network)
  • request / bio metadata (block)
  • wake_q / task_struct wakeup context
  • tracepoints for attribution

§ 06Patch Stack: RFC-Style Series

[PATCH 0/8] agent-latency: propagate latency-critical completion context

[PATCH 1/8] sched: add task flag for agent latency-sensitive control loops
[PATCH 2/8] sched: add agent_home_cpu and wakeup placement hint
[PATCH 3/8] irq: add IRQ-to-wakeup attribution tracepoints
[PATCH 4/8] net: allow socket-level agent latency tagging
[PATCH 5/8] net/napi: add latency-bounded processing for agent-tagged flows
[PATCH 6/8] block/blk-mq: propagate agent latency tags through request completion
[PATCH 7/8] io_uring: expose agent_step_id on SQEs and CQEs
[PATCH 8/8] sched: bias wakeups away from IRQ/softirq-hot CPUs

6.1 — Scheduler Flag

#define SCHED_FLAG_AGENT_LATENCY 0x10000000

static inline bool task_agent_latency(const struct task_struct *p)
{
    return p->sched_flags & SCHED_FLAG_AGENT_LATENCY;
}

6.2 — Wakeup Placement

/* Conceptual hook near try_to_wake_up() / select_task_rq_fair().
 * Do not replace EEVDF. Bias wakeup CPU selection for latency-sensitive
 * control loops.
 */
if (task_agent_latency(p)) {
    int home = p->agent_home_cpu;

    if (cpu_online(home)           &&
        !cpu_irq_hot(home)         &&
        !cpu_softirq_pressure(home) &&
        cpumask_test_cpu(home, p->cpus_ptr)) {
        cpu = home;
    } else {
        cpu = select_nearby_quiet_cpu(p);
    }
}

6.3 — Completion Latency Class

enum completion_latency_class {
    COMPLETION_NORMAL            = 0,
    COMPLETION_AGENT_INTERACTIVE = 1,
    COMPLETION_BACKGROUND        = 2,
};

6.4 — blk-mq Request Metadata

struct request {
    /* existing fields ... */
#ifdef CONFIG_AGENT_LATENCY
    u64 agent_id;
    u64 agent_step_id;
    u8  completion_latency_class;
#endif
};

6.5 — Network skb Metadata

struct sk_buff {
    /* existing fields ... */
#ifdef CONFIG_AGENT_LATENCY
    u64 agent_id;
    u64 agent_step_id;
    u8  completion_latency_class;
#endif
};

§ 07NAPI and Softirq: Latency-Bounded, Not Batch-Free

The proposal is not "disable NAPI" or "turn off batching." That harms throughput and may trigger IRQ storms. Instead, the policy should be latency-bounded batching for tagged flows:

/* Conceptual net_rx_action() policy.
 * If an agent-critical waiter is blocked behind this NAPI instance,
 * cap the service window and wake the agent before continuing
 * background batching.
 */
if (napi_has_agent_waiter(napi)) {
    budget      = min(budget, AGENT_NAPI_PACKET_BUDGET);
    time_budget = min(time_budget, AGENT_NAPI_USEC_BUDGET);
}

work_done = napi_poll(napi, budget);

if (agent_waiter_ready(napi))
    agent_wake_latency_critical_waiter(napi);

This keeps NAPI's core model intact while preventing long polling bursts from hiding a ready agent behind background work.

§ 08IRQ/Softirq CPU Topology: Separate Control and Completion

Agentic systems benefit from explicitly separating CPUs that run the control loop from CPUs that absorb IRQ, softirq, and block completions.

CONTROL / COMPLETION CPU SPLIT COMPLETION CPUs → NIC IRQs, NVMe IRQs → NAPI / softirq → blk-mq completions → background batching AGENT HOME CPU → agent control loop → epoll / io_uring return → parse + next decision → cache-hot execution targeted wakeup
Fig 3 — The completion CPU absorbs noisy work; the agent CPU resumes the control loop predictably.

§ 09eBPF Measurement Plan

Before policy changes, expose enough attribution to prove the bottleneck. These bpftrace snippets are minimal starting points.

9.1 — IRQ Handler Duration

bpftrace -e '
tracepoint:irq:irq_handler_entry
{
  @irq_start[args->irq] = nsecs;
}
tracepoint:irq:irq_handler_exit
/@irq_start[args->irq]/
{
  printf("irq=%d duration=%dus\n",
    args->irq, (nsecs - @irq_start[args->irq]) / 1000);
  delete(@irq_start[args->irq]);
}'

9.2 — Softirq Runtime

bpftrace -e '
tracepoint:irq:softirq_entry
{
  @soft_start[args->vec] = nsecs;
}
tracepoint:irq:softirq_exit
/@soft_start[args->vec]/
{
  printf("softirq vec=%d duration=%dus\n",
    args->vec, (nsecs - @soft_start[args->vec]) / 1000);
  delete(@soft_start[args->vec]);
}'

9.3 — Wakeup-to-Run Latency

bpftrace -e '
tracepoint:sched:sched_wakeup
{
  @wake[args->pid] = nsecs;
}
tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
  printf("pid=%d wake_to_run=%dus\n",
    args->next_pid, (nsecs - @wake[args->next_pid]) / 1000);
  delete(@wake[args->next_pid]);
}'

9.4 — epoll Gate Latency

bpftrace -e '
tracepoint:syscalls:sys_enter_epoll_wait { @epoll[tid] = nsecs; }
tracepoint:syscalls:sys_exit_epoll_wait
/@epoll[tid]/
{
  printf("tid=%d epoll_wait=%dus ret=%d\n",
    tid, (nsecs - @epoll[tid]) / 1000, args->ret);
  delete(@epoll[tid]);
}'

9.5 — io_uring Gate Latency

bpftrace -e '
tracepoint:syscalls:sys_enter_io_uring_enter { @uring[tid] = nsecs; }
tracepoint:syscalls:sys_exit_io_uring_enter
/@uring[tid]/
{
  printf("tid=%d io_uring_enter=%dus ret=%d\n",
    tid, (nsecs - @uring[tid]) / 1000, args->ret);
  delete(@uring[tid]);
}'

§ 10Expected Latency Budget

Actual numbers depend on hardware, kernel config, driver, queue depth, and system load. The key insight is the shape of the tail: each stage is small alone, but the chain repeats many times per agent session.

Stage Typical uncontended Tail under pressure Agent-aware target
IRQ entry / handler ~1–10 µs 10–50+ µs bounded, attributed
NAPI / softirq ~5–50 µs 100–500+ µs latency-bounded batching
blk-mq completion ~5–30 µs 100+ µs + device tail completion priority + steering
wakeup-to-run ~20–200 µs 500+ µs warm CPU, IRQ-quiet CPU
epoll / io_uring return ~5–50 µs 100+ µs step-attributed gate latency

§ 11What Not To Do First

The proposal deliberately avoids blunt mechanisms that look attractive but are unsafe to deploy broadly.

Do NOT start with:
  ✗  replacing EEVDF / CFS
  ✗  running agents as SCHED_FIFO priority 99
  ✗  globally disabling interrupt coalescing
  ✗  globally disabling NAPI batching
  ✗  polling every device all the time
  ✗  bypassing the kernel completely
Those approaches may improve one benchmark and damage system stability, fairness, thermals, or multi-tenant behavior. The deployable path is a budgeted latency class plus observability.

§ 12Proposed Evaluation

A credible experiment compares default Linux behavior against progressive isolation and finally the proposed kernel changes. Metrics must focus on p99 / p999 step latency, not average throughput.

Experiment A — baseline
  default IRQ affinity · default scheduler · default NAPI/block

Experiment B — manual tuning
  move IRQs off agent CPU · pin thread · chrt/renice · interrupt moderation

Experiment C — observability only
  agent_step_id tracepoints · epoll/io_uring gate timing · per-step attribution

Experiment D — policy patches
  agent wakeup placement · latency-bounded NAPI
  blk-mq completion class · IRQ/softirq-hot CPU avoidance
MetricWhy it matters
agent step p50 / p99 / p999 captures latency amplification across the full chain
wakeup-to-run latency isolates the scheduler's contribution
softirq runtime + ksoftirqd share identifies deferred completion noise
NVMe completion-to-CQE latency captures disk-side completion tail
epoll / io_uring blocked time shows user-visible wait gates

§ 13Final Thesis

The kernel should treat agent steps as first-class latency-critical completion chains — not as unrelated IRQs, softirqs, I/O completions, and wakeups.

The next useful primitive is not just a process, thread, socket, or request. It is the agent step: a bounded, causally connected unit of work that crosses userspace, IRQ handling, network, block I/O, memory, and scheduler paths. Making that path predictable is a kernel problem, and the kernel has the right place to solve it.

RefsReferences

  1. [1]Linux kernel documentation: generic IRQ layer — abstraction of interrupt handling for device drivers.
  2. [2]Linux kernel documentation: NAPI — event handling model where device interrupts schedule NAPI processing.
  3. [3]Linux kernel documentation: blk-mq — multi-queue block I/O mechanism for high-IOPS storage devices.
  4. [4]Linux kernel documentation: EEVDF scheduler — fair scheduling using virtual runtime, lag, and virtual deadlines.