A proposal for agent-aware completion chains across IRQ, softirq, NAPI, blk-mq, wakeups, and scheduling — treating agent steps as first-class latency-critical work units in the Linux kernel.
The mistake is to ask, "How do we make NIC IRQs faster?" or "How do we make NVMe IRQs faster?" The better question is:
In an agentic runtime, a tool result may arrive through the network, a retrieval page from NVMe, a local database operation from io_uring, or a timer may simply expire. Different sources — same conceptual path:
asynchronous event completes
↓
kernel completion handling
↓
waiter becomes runnable
↓
scheduler places it on a CPU
↓
userspace agent loop resumes
The kernel already has strong machinery here: the generic IRQ layer abstracts device interrupts, NAPI handles network event processing via interrupt-then-poll, and blk-mq exploits multi-queue parallelism for fast storage. The issue is not that these systems are bad. It is that they are optimized for general-purpose fairness and throughput, not per-agent-step tail latency.
For a networked tool response, the simplified path:
NIC DMA writes packet into RX ring
↓ MSI-X interrupt
↓ generic IRQ layer
↓ driver hardirq handler → napi_schedule()
↓ NET_RX softirq → net_rx_action()
↓ driver NAPI poll receives packets
↓ TCP/IP stack attaches data to socket
↓ epoll/io_uring waiter is woken
↓ scheduler runs agent thread
↓ userspace parses tool reply
NAPI is the correct model for high-rate network events: device interrupts, then host-scheduled polling. But for an agent loop, excessive batching or delayed
A vector index read, small SQLite/LMDB page fetch, local document retrieval, or mmap major fault can all arrive through the block completion path:
NVMe controller completes command
↓ MSI-X interrupt → nvme_irq()
↓ blk_mq_complete_request()
↓ bio_endio() / request completion
↓ io_uring CQE or blocking read wakeup
↓ scheduler runs agent thread
↓ userspace resumes with data
blk-mq is designed for modern high-IOPS storage using software and hardware queues to exploit parallelism. But the agentic case is not simply "maximize IOPS." It often needs small, blocking, latency-critical reads to complete predictably even when background writeback, logging, indexing, or scans are active.
Today, the kernel sees only local, per-event information:
| Subsystem | What it sees | What the agent runtime needs |
|---|---|---|
| generic IRQ | device interrupt vector, irq_desc, handler | is this IRQ completing an agent-critical step? |
| NAPI / softirq | packet budget, poll state, backlog | will batching delay a waiting control loop? |
| blk-mq | request queues, hw queues, completion | is this small read blocking a step? |
| scheduler | task runnable state, CPU selection, fairness | should this agent resume on a warm, IRQ-quiet CPU? |
| epoll / io_uring | completion queue or readiness event | which agent step did this unblock? |
The central proposal: add a small, budgeted, propagation-friendly context object for agent-critical completion chains.
struct agent_latency_ctx {
u64 agent_id; /* runtime / session / agent identity */
u64 step_id; /* current reasoning / tool step */
u8 latency_class; /* normal | agent-interactive | background */
u64 deadline_ns; /* optional soft deadline */
u64 budget_ns; /* cgroup-controlled latency budget */
};
This context is attached at the userspace boundary and propagated only along paths that can prove causality:
[PATCH 0/8] agent-latency: propagate latency-critical completion context
[PATCH 1/8] sched: add task flag for agent latency-sensitive control loops
[PATCH 2/8] sched: add agent_home_cpu and wakeup placement hint
[PATCH 3/8] irq: add IRQ-to-wakeup attribution tracepoints
[PATCH 4/8] net: allow socket-level agent latency tagging
[PATCH 5/8] net/napi: add latency-bounded processing for agent-tagged flows
[PATCH 6/8] block/blk-mq: propagate agent latency tags through request completion
[PATCH 7/8] io_uring: expose agent_step_id on SQEs and CQEs
[PATCH 8/8] sched: bias wakeups away from IRQ/softirq-hot CPUs
#define SCHED_FLAG_AGENT_LATENCY 0x10000000
static inline bool task_agent_latency(const struct task_struct *p)
{
return p->sched_flags & SCHED_FLAG_AGENT_LATENCY;
}
/* Conceptual hook near try_to_wake_up() / select_task_rq_fair().
* Do not replace EEVDF. Bias wakeup CPU selection for latency-sensitive
* control loops.
*/
if (task_agent_latency(p)) {
int home = p->agent_home_cpu;
if (cpu_online(home) &&
!cpu_irq_hot(home) &&
!cpu_softirq_pressure(home) &&
cpumask_test_cpu(home, p->cpus_ptr)) {
cpu = home;
} else {
cpu = select_nearby_quiet_cpu(p);
}
}
enum completion_latency_class {
COMPLETION_NORMAL = 0,
COMPLETION_AGENT_INTERACTIVE = 1,
COMPLETION_BACKGROUND = 2,
};
struct request {
/* existing fields ... */
#ifdef CONFIG_AGENT_LATENCY
u64 agent_id;
u64 agent_step_id;
u8 completion_latency_class;
#endif
};
struct sk_buff {
/* existing fields ... */
#ifdef CONFIG_AGENT_LATENCY
u64 agent_id;
u64 agent_step_id;
u8 completion_latency_class;
#endif
};
The proposal is not "disable NAPI" or "turn off batching." That harms throughput and may trigger IRQ storms. Instead, the policy should be latency-bounded batching for tagged flows:
/* Conceptual net_rx_action() policy.
* If an agent-critical waiter is blocked behind this NAPI instance,
* cap the service window and wake the agent before continuing
* background batching.
*/
if (napi_has_agent_waiter(napi)) {
budget = min(budget, AGENT_NAPI_PACKET_BUDGET);
time_budget = min(time_budget, AGENT_NAPI_USEC_BUDGET);
}
work_done = napi_poll(napi, budget);
if (agent_waiter_ready(napi))
agent_wake_latency_critical_waiter(napi);
This keeps NAPI's core model intact while preventing long polling bursts from hiding a ready agent behind background work.
Agentic systems benefit from explicitly separating CPUs that run the control loop from CPUs that absorb IRQ, softirq, and block completions.
Before policy changes, expose enough attribution to prove the bottleneck. These bpftrace snippets are minimal starting points.
bpftrace -e '
tracepoint:irq:irq_handler_entry
{
@irq_start[args->irq] = nsecs;
}
tracepoint:irq:irq_handler_exit
/@irq_start[args->irq]/
{
printf("irq=%d duration=%dus\n",
args->irq, (nsecs - @irq_start[args->irq]) / 1000);
delete(@irq_start[args->irq]);
}'
bpftrace -e '
tracepoint:irq:softirq_entry
{
@soft_start[args->vec] = nsecs;
}
tracepoint:irq:softirq_exit
/@soft_start[args->vec]/
{
printf("softirq vec=%d duration=%dus\n",
args->vec, (nsecs - @soft_start[args->vec]) / 1000);
delete(@soft_start[args->vec]);
}'
bpftrace -e '
tracepoint:sched:sched_wakeup
{
@wake[args->pid] = nsecs;
}
tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
printf("pid=%d wake_to_run=%dus\n",
args->next_pid, (nsecs - @wake[args->next_pid]) / 1000);
delete(@wake[args->next_pid]);
}'
bpftrace -e '
tracepoint:syscalls:sys_enter_epoll_wait { @epoll[tid] = nsecs; }
tracepoint:syscalls:sys_exit_epoll_wait
/@epoll[tid]/
{
printf("tid=%d epoll_wait=%dus ret=%d\n",
tid, (nsecs - @epoll[tid]) / 1000, args->ret);
delete(@epoll[tid]);
}'
bpftrace -e '
tracepoint:syscalls:sys_enter_io_uring_enter { @uring[tid] = nsecs; }
tracepoint:syscalls:sys_exit_io_uring_enter
/@uring[tid]/
{
printf("tid=%d io_uring_enter=%dus ret=%d\n",
tid, (nsecs - @uring[tid]) / 1000, args->ret);
delete(@uring[tid]);
}'
Actual numbers depend on hardware, kernel config, driver, queue depth, and system load. The key insight is the shape of the tail: each stage is small alone, but the chain repeats many times per agent session.
| Stage | Typical uncontended | Tail under pressure | Agent-aware target |
|---|---|---|---|
| IRQ entry / handler | ~1–10 µs | 10–50+ µs | bounded, attributed |
| NAPI / softirq | ~5–50 µs | 100–500+ µs | latency-bounded batching |
| blk-mq completion | ~5–30 µs | 100+ µs + device tail | completion priority + steering |
| wakeup-to-run | ~20–200 µs | 500+ µs | warm CPU, IRQ-quiet CPU |
| epoll / io_uring return | ~5–50 µs | 100+ µs | step-attributed gate latency |
The proposal deliberately avoids blunt mechanisms that look attractive but are unsafe to deploy broadly.
Do NOT start with:
✗ replacing EEVDF / CFS
✗ running agents as SCHED_FIFO priority 99
✗ globally disabling interrupt coalescing
✗ globally disabling NAPI batching
✗ polling every device all the time
✗ bypassing the kernel completely
A credible experiment compares default Linux behavior against progressive isolation and finally the proposed kernel changes. Metrics must focus on p99 / p999 step latency, not average throughput.
Experiment A — baseline
default IRQ affinity · default scheduler · default NAPI/block
Experiment B — manual tuning
move IRQs off agent CPU · pin thread · chrt/renice · interrupt moderation
Experiment C — observability only
agent_step_id tracepoints · epoll/io_uring gate timing · per-step attribution
Experiment D — policy patches
agent wakeup placement · latency-bounded NAPI
blk-mq completion class · IRQ/softirq-hot CPU avoidance
| Metric | Why it matters |
|---|---|
| agent step p50 / p99 / p999 | captures latency amplification across the full chain |
| wakeup-to-run latency | isolates the scheduler's contribution |
| softirq runtime + ksoftirqd share | identifies deferred completion noise |
| NVMe completion-to-CQE latency | captures disk-side completion tail |
| epoll / io_uring blocked time | shows user-visible wait gates |
The next useful primitive is not just a process, thread, socket, or request. It is the agent step: a bounded, causally connected unit of work that crosses userspace, IRQ handling, network, block I/O, memory, and scheduler paths. Making that path predictable is a kernel problem, and the kernel has the right place to solve it.