Abstract
Agentic AI workloads are dominated by control-loop latency: tool calls, small responses, metadata lookups, event-loop wakeups, and short bursts of parsing. In this regime, interrupts are not background noise. They are part of the application’s critical path.
This note follows a single completion event from device MSI-X delivery to userspace wakeup. It maps the path onto x86 architectural state, estimates stage-level latency from nanoseconds to milliseconds, and provides eBPF probes that make the path measurable in production. The thesis is simple: agentic systems need attributable IRQ paths. Without attribution, every stall collapses into “Python is slow” or “the agent is slow,” even when the root cause is hardirq placement, softirq backlog, scheduler wakeup delay, or small-I/O completion jitter.
1. Agentic AI as a Latency-Amplified OS Workload
Agentic systems execute repeated control loops: plan, call a tool, wait for completion, parse output, update state, and continue. Each step is small, but the number of steps is large. The result is latency amplification.
agent_step_latency = L_model_step
+ L_tool_rpc
+ L_irq_delivery
+ L_softirq_or_threaded_irq
+ L_scheduler_wakeup
+ L_userspace_parse
total_task_latency ≈ Σ agent_step_latency over hundreds or thousands of steps
2. End-to-End IRQ → Userspace Path
The critical path begins when a device completes work. For a networked tool call this is commonly a NIC RX completion. For local retrieval it may be NVMe completion. The device raises an MSI-X interrupt, the local APIC accepts it, the CPU uses the interrupt vector to index the IDT, and Linux dispatches through the generic IRQ layer.
Device → MSI-X → Local APIC → IDT[vector] → x86 entry stub
→ irq_desc → driver top half → softirq/NAPI or threaded IRQ
→ socket/block completion → wake_up_process()
→ scheduler → epoll/io_uring/read returns
3. x86 Register and Control-State Map
Intel and AMD differ in implementation details, but the x86-64 architectural interrupt model is broadly shared: the CPU receives a vector, indexes the IDT using IDTR, saves architectural state, switches stacks on privilege transition, and transfers control to the kernel entry stub.
| State | Role | Why it matters for latency |
|---|---|---|
IDTR | Points to the Interrupt Descriptor Table. | IDT lookup is fast, but entry path placement affects cache footprint. |
RIP / CS / RFLAGS | Saved interrupted state. | Interrupts can preempt agent code at arbitrary instruction boundaries. |
RSP / SS / TSS.RSP0 / IST | Stack switch on privilege transition. | Stack/cache locality affects entry/exit overhead. |
APIC IRR / ISR | Pending and in-service interrupt tracking. | Backlog appears as delayed delivery and priority interactions. |
APIC TPR | Priority threshold. | Priority policy can defer lower-priority interrupts. |
APIC EOI | Signals interrupt completion. | EOI timing interacts with interrupt moderation and re-entry. |
4. Cycle-Level Latency Annotations
Exact latency depends on CPU frequency, cache state, device, interrupt moderation, kernel config, PREEMPT_RT, workload, and contention. The table below gives practical order-of-magnitude values for reasoning and instrumentation.
| Stage | Quiet-system order | Tail under contention | Typical cause of tail |
|---|---|---|---|
| Device CQ update | 100 ns – 2 µs | 5–20 µs | PCIe/device queue congestion |
| MSI-X delivery to Local APIC | 100 ns – 1 µs | 2–10 µs | interrupt remapping, posted interrupt delay |
| IDT vector dispatch | 50–300 ns | <1 µs | cache miss on entry path |
| x86 entry + state save | 0.5–3 µs | 5–15 µs | mitigations, cache/TLB state, stack locality |
| generic IRQ dispatch | 1–5 µs | 10–40 µs | shared IRQ, lock contention, tracing overhead |
| driver top half | 1–15 µs | 50–200 µs | excess work in hardirq, queue handling |
| NAPI / softirq | 5–50 µs | 100 µs – 2 ms | budget exhaustion, ksoftirqd backlog, wrong CPU |
| TCP/block completion | 2–50 µs | 100 µs – 1 ms | socket locks, epoll wakeups, block queueing |
| scheduler wakeup to run | 5–30 µs | 100 µs – 5 ms | runqueue pressure, migration, noisy cores |
| userspace event-loop resume | 1–20 µs | 100 µs+ | runtime contention, GC, parser work |
5. eBPF Tracing Code for Each Stage
The first step is observability. The following probes are intentionally practical: they can be run with bpftrace on many Linux systems and then hardened into libbpf tools later.
5.1 IRQ Entry/Exit Latency by IRQ Number
#!/usr/bin/env bpftrace
tracepoint:irq:irq_handler_entry
{
@irq_start[args->irq] = nsecs;
@irq_name[args->irq] = args->name;
}
tracepoint:irq:irq_handler_exit
/@irq_start[args->irq]/
{
$delta = nsecs - @irq_start[args->irq];
@irq_latency_us[args->irq, @irq_name[args->irq]] = hist($delta / 1000);
delete(@irq_start[args->irq]);
}
interval:s:10
{
print(@irq_latency_us);
clear(@irq_latency_us);
}
5.2 Softirq / NAPI Tail Latency
#!/usr/bin/env bpftrace
tracepoint:irq:softirq_entry
{
@soft_start[args->vec] = nsecs;
}
tracepoint:irq:softirq_exit
/@soft_start[args->vec]/
{
@softirq_us[args->vec] = hist((nsecs - @soft_start[args->vec]) / 1000);
delete(@soft_start[args->vec]);
}
kprobe:napi_poll
{
@napi_start[tid] = nsecs;
}
kretprobe:napi_poll
/@napi_start[tid]/
{
@napi_poll_us[comm] = hist((nsecs - @napi_start[tid]) / 1000);
delete(@napi_start[tid]);
}
5.3 Scheduler Wakeup-to-Run Latency for Agent Processes
#!/usr/bin/env bpftrace
tracepoint:sched:sched_wakeup
/str(args->comm) == "agent" || str(args->comm) == "python"/
{
@wakeup[args->pid] = nsecs;
}
tracepoint:sched:sched_switch
/@wakeup[args->next_pid]/
{
@wakeup_to_run_us[args->next_comm] = hist((nsecs - @wakeup[args->next_pid]) / 1000);
delete(@wakeup[args->next_pid]);
}
5.4 Syscall Latency: open/read/stat/mmap/epoll
#!/usr/bin/env bpftrace
tracepoint:raw_syscalls:sys_enter
/comm == "agent" || comm == "python"/
{
@sys_start[tid] = nsecs;
@sys_id[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit
/@sys_start[tid]/
{
@syscall_us[@sys_id[tid]] = hist((nsecs - @sys_start[tid]) / 1000);
delete(@sys_start[tid]);
delete(@sys_id[tid]);
}
5.5 Block I/O Completion Latency for Small Reads
#!/usr/bin/env bpftrace
tracepoint:block:block_rq_issue
{
@rq_start[args->dev, args->sector] = nsecs;
@rq_bytes[args->dev, args->sector] = args->bytes;
}
tracepoint:block:block_rq_complete
/@rq_start[args->dev, args->sector]/
{
$bytes = @rq_bytes[args->dev, args->sector];
if ($bytes <= 65536) {
@small_read_us[args->dev] = hist((nsecs - @rq_start[args->dev, args->sector]) / 1000);
}
delete(@rq_start[args->dev, args->sector]);
delete(@rq_bytes[args->dev, args->sector]);
}
5.6 Page Fault Latency During mmap-backed Retrieval
#!/usr/bin/env bpftrace
tracepoint:exceptions:page_fault_user
/comm == "agent" || comm == "python"/
{
@pf_count[comm] = count();
}
kprobe:handle_mm_fault
/comm == "agent" || comm == "python"/
{
@mm_fault_start[tid] = nsecs;
}
kretprobe:handle_mm_fault
/@mm_fault_start[tid]/
{
@mm_fault_us[comm] = hist((nsecs - @mm_fault_start[tid]) / 1000);
delete(@mm_fault_start[tid]);
}
5.7 End-to-End Approximation: IRQ Exit to Agent Running
#!/usr/bin/env bpftrace
tracepoint:irq:irq_handler_exit
{
@last_irq_exit[cpu] = nsecs;
}
tracepoint:sched:sched_switch
/str(args->next_comm) == "agent" || str(args->next_comm) == "python"/
{
if (@last_irq_exit[cpu]) {
@irq_to_agent_run_us[args->next_comm] = hist((nsecs - @last_irq_exit[cpu]) / 1000);
}
}
This last probe is an approximation because a task may be woken by work caused by an IRQ on a different CPU. A production version should tag request IDs or socket cookies and correlate across CPUs.
6. Kernel Patch Sketches
Once the path is measurable, the next step is reducing variance. These are not drop-in upstream patches; they are RFC-grade sketches that show where kernel work could go.
6.1 Per-Step IRQ Attribution
/* include/linux/agent_latency.h */
struct agent_step_ctx {
u64 step_id;
u64 deadline_ns;
u32 latency_budget_us;
};
/* task_struct extension, guarded by CONFIG_AGENT_LATENCY */
struct task_struct {
...
struct agent_step_ctx *agent_ctx;
};
/* kernel/irq/handle.c - conceptual */
void note_agent_irq_completion(struct irq_desc *desc)
{
struct agent_step_ctx *ctx = current->agent_ctx;
if (!ctx)
return;
trace_agent_irq_stage(ctx->step_id,
desc->irq_data.irq,
ktime_get_ns());
}
6.2 Latency-First IRQ Placement
/* Prefer IRQ CPUs near the agent control-loop CPU. */
static int agent_select_irq_cpu(struct irq_desc *desc,
struct task_struct *agent)
{
int cpu = task_cpu(agent);
if (cpu_online(cpu) && !cpu_is_noisy(cpu))
return cpu;
return cpumask_first_and(agent->cpus_ptr, housekeeping_cpumask(HK_TYPE_DOMAIN));
}
6.3 Softirq Budget Guardrail for Agent Cores
/* net/core/dev.c - conceptual */
if (cpu_has_agent_latency_task(smp_processor_id())) {
budget = min_t(int, budget, sysctl_agent_napi_budget);
time_limit = min_t(u64, time_limit, AGENT_SOFTIRQ_BUDGET_NS);
}
7. Evaluation Plan
A serious evaluation should measure distributions, not averages. Agent systems care about p99 and p999 because rare stalls occur repeatedly across task execution.
| Metric | Why it matters | Probe/tool |
|---|---|---|
| IRQ handler latency | Hardirq path cost | irq_handler_entry/exit |
| Softirq/NAPI duration | Deferred network work tail | softirq_entry/exit, napi_poll |
| Wakeup-to-run latency | Scheduler contribution | sched_wakeup + sched_switch |
| Small read latency | Retrieval/index path | block_rq_issue/complete |
| mmap fault latency | Memory-mapped index behavior | handle_mm_fault |
| Agent step p99/p999 | User-visible loop stability | application step timestamps |
Experiment matrix:
Baseline:
irqbalance=on
default IRQ affinity
default NAPI/softirq behavior
unpinned agent process
Tuned:
IRQ affinity pinned away from agent core
agent process CPU-pinned
NAPI budget bounded on agent cores
io_uring or epoll completions tagged by step_id
Report:
p50/p95/p99/p999 agent step latency
IRQ-to-agent wakeup histogram
softirq runtime histogram
CPU migration count
small I/O completion histogram
Conclusion
For agentic AI workloads, interrupts are not invisible plumbing. A tool response can traverse MSI-X, APIC, IDT dispatch, x86 kernel entry, generic IRQ handling, driver top halves, softirq/NAPI, scheduler wakeups, and runtime event loops before the agent can continue. Each stage is small. The chain is not.
The next useful kernel optimization for agent systems is not only faster handlers; it is attribution: knowing which interrupt, which softirq, which completion, and which scheduler decision delayed a specific agent step.