Systems Note · Linux IRQ Path · Agentic AI

From IRQ to Userspace: The Kernel Latency Path Behind Agentic AI Jitter

A technical walkthrough of how a device interrupt travels through x86 APIC/IDT state, Linux IRQ handling, softirq/NAPI, scheduler wakeup, and finally into an agent process—with cycle-level latency annotations and eBPF tracing code for each stage.

USENIX-style draft x86 Intel/AMD register map eBPF + bpftrace probes 10 min read

Abstract

Agentic AI workloads are dominated by control-loop latency: tool calls, small responses, metadata lookups, event-loop wakeups, and short bursts of parsing. In this regime, interrupts are not background noise. They are part of the application’s critical path.

This note follows a single completion event from device MSI-X delivery to userspace wakeup. It maps the path onto x86 architectural state, estimates stage-level latency from nanoseconds to milliseconds, and provides eBPF probes that make the path measurable in production. The thesis is simple: agentic systems need attributable IRQ paths. Without attribution, every stall collapses into “Python is slow” or “the agent is slow,” even when the root cause is hardirq placement, softirq backlog, scheduler wakeup delay, or small-I/O completion jitter.

1. Agentic AI as a Latency-Amplified OS Workload

Agentic systems execute repeated control loops: plan, call a tool, wait for completion, parse output, update state, and continue. Each step is small, but the number of steps is large. The result is latency amplification.

agent_step_latency = L_model_step
                   + L_tool_rpc
                   + L_irq_delivery
                   + L_softirq_or_threaded_irq
                   + L_scheduler_wakeup
                   + L_userspace_parse

total_task_latency ≈ Σ agent_step_latency over hundreds or thousands of steps
Key observation: a 20–80 µs delay that would be irrelevant in a throughput server becomes visible when multiplied across thousands of tool completions and wakeups.
Latency amplification in an agent loop Each completion is small; repeated thousands of times, kernel jitter becomes user-visible. Tool Reply 2–8 KB JSON IRQ Path 2–80 µs Softirq 5–200 µs tail Scheduler 10 µs–2 ms tail Userspace Agent epoll/io_uring returns (IRQ + softirq + scheduler + parse) × N steps Sub-millisecond jitter compounds into seconds when N = hundreds or thousands.
Figure 1. Agent workloads transform small kernel delays into visible tail latency through repeated control-loop execution.

2. End-to-End IRQ → Userspace Path

The critical path begins when a device completes work. For a networked tool call this is commonly a NIC RX completion. For local retrieval it may be NVMe completion. The device raises an MSI-X interrupt, the local APIC accepts it, the CPU uses the interrupt vector to index the IDT, and Linux dispatches through the generic IRQ layer.

IRQ delivery path: device completion to agent wakeup Approximate latencies are workload and hardware dependent; use these as order-of-magnitude anchors. 1. Device completionNIC/NVMe CQ update: ~100ns–2µs 2. PCIe MSI-XInterrupt message: ~100ns–1µs 3. Local APICVector arbitration: ~100ns–1µs 4. IDT lookupIDTR + vector: ~50–300ns 5. x86 entry stubstate save/swap: ~0.5–3µs 6. Linux irq_descgeneric IRQ: ~1–5µs 7. Driver top halfack + defer: ~1–15µs 8. Bottom halfsoftirq/NAPI: ~5–200µs tail 9. Completion pathTCP/block wakeup: ~2–50µs 10. Schedulerwakeup → run: ~5µs–2ms 11. Userspace agentepoll/io_uring returns
Figure 2. The interrupt-to-userspace path. In quiet systems, the early hardware path is nanosecond-to-microsecond scale; the large tail comes from bottom halves, queueing, and scheduler wakeup.
Device → MSI-X → Local APIC → IDT[vector] → x86 entry stub
       → irq_desc → driver top half → softirq/NAPI or threaded IRQ
       → socket/block completion → wake_up_process()
       → scheduler → epoll/io_uring/read returns

3. x86 Register and Control-State Map

Intel and AMD differ in implementation details, but the x86-64 architectural interrupt model is broadly shared: the CPU receives a vector, indexes the IDT using IDTR, saves architectural state, switches stacks on privilege transition, and transfers control to the kernel entry stub.

x86 architectural state touched during interrupt delivery CPU Core IDTRIDT base + limit RIPinterrupted instruction RFLAGSIF, flags saved CS/SSprivilege level RSPuser/kernel stack TSSRSP0 / IST stack Local APIC Vectorinterrupt number IRRpending requests ISRin-service IRQs TPRpriority threshold EOIcompletion signal Kernel Entry IDT gatehandler address pt_regssaved state irq_descLinux IRQ object handlerdriver top half softirqdeferred work
Figure 3. Register/control-state map for x86 interrupt dispatch. The model applies to Intel and AMD x86-64; vendor differences usually appear below this abstraction in microarchitecture and APIC implementation details.
StateRoleWhy it matters for latency
IDTRPoints to the Interrupt Descriptor Table.IDT lookup is fast, but entry path placement affects cache footprint.
RIP / CS / RFLAGSSaved interrupted state.Interrupts can preempt agent code at arbitrary instruction boundaries.
RSP / SS / TSS.RSP0 / ISTStack switch on privilege transition.Stack/cache locality affects entry/exit overhead.
APIC IRR / ISRPending and in-service interrupt tracking.Backlog appears as delayed delivery and priority interactions.
APIC TPRPriority threshold.Priority policy can defer lower-priority interrupts.
APIC EOISignals interrupt completion.EOI timing interacts with interrupt moderation and re-entry.

4. Cycle-Level Latency Annotations

Exact latency depends on CPU frequency, cache state, device, interrupt moderation, kernel config, PREEMPT_RT, workload, and contention. The table below gives practical order-of-magnitude values for reasoning and instrumentation.

StageQuiet-system orderTail under contentionTypical cause of tail
Device CQ update100 ns – 2 µs5–20 µsPCIe/device queue congestion
MSI-X delivery to Local APIC100 ns – 1 µs2–10 µsinterrupt remapping, posted interrupt delay
IDT vector dispatch50–300 ns<1 µscache miss on entry path
x86 entry + state save0.5–3 µs5–15 µsmitigations, cache/TLB state, stack locality
generic IRQ dispatch1–5 µs10–40 µsshared IRQ, lock contention, tracing overhead
driver top half1–15 µs50–200 µsexcess work in hardirq, queue handling
NAPI / softirq5–50 µs100 µs – 2 msbudget exhaustion, ksoftirqd backlog, wrong CPU
TCP/block completion2–50 µs100 µs – 1 mssocket locks, epoll wakeups, block queueing
scheduler wakeup to run5–30 µs100 µs – 5 msrunqueue pressure, migration, noisy cores
userspace event-loop resume1–20 µs100 µs+runtime contention, GC, parser work
Design implication: optimize the median for throughput, but optimize p99/p999 for agents. A system with 20 µs median IRQ-to-wakeup latency and 2 ms p999 latency will feel unstable under multi-step tool execution.

5. eBPF Tracing Code for Each Stage

The first step is observability. The following probes are intentionally practical: they can be run with bpftrace on many Linux systems and then hardened into libbpf tools later.

5.1 IRQ Entry/Exit Latency by IRQ Number

#!/usr/bin/env bpftrace

tracepoint:irq:irq_handler_entry
{
  @irq_start[args->irq] = nsecs;
  @irq_name[args->irq] = args->name;
}

tracepoint:irq:irq_handler_exit
/@irq_start[args->irq]/
{
  $delta = nsecs - @irq_start[args->irq];
  @irq_latency_us[args->irq, @irq_name[args->irq]] = hist($delta / 1000);
  delete(@irq_start[args->irq]);
}

interval:s:10
{
  print(@irq_latency_us);
  clear(@irq_latency_us);
}

5.2 Softirq / NAPI Tail Latency

#!/usr/bin/env bpftrace

tracepoint:irq:softirq_entry
{
  @soft_start[args->vec] = nsecs;
}

tracepoint:irq:softirq_exit
/@soft_start[args->vec]/
{
  @softirq_us[args->vec] = hist((nsecs - @soft_start[args->vec]) / 1000);
  delete(@soft_start[args->vec]);
}

kprobe:napi_poll
{
  @napi_start[tid] = nsecs;
}

kretprobe:napi_poll
/@napi_start[tid]/
{
  @napi_poll_us[comm] = hist((nsecs - @napi_start[tid]) / 1000);
  delete(@napi_start[tid]);
}

5.3 Scheduler Wakeup-to-Run Latency for Agent Processes

#!/usr/bin/env bpftrace

tracepoint:sched:sched_wakeup
/str(args->comm) == "agent" || str(args->comm) == "python"/
{
  @wakeup[args->pid] = nsecs;
}

tracepoint:sched:sched_switch
/@wakeup[args->next_pid]/
{
  @wakeup_to_run_us[args->next_comm] = hist((nsecs - @wakeup[args->next_pid]) / 1000);
  delete(@wakeup[args->next_pid]);
}

5.4 Syscall Latency: open/read/stat/mmap/epoll

#!/usr/bin/env bpftrace

tracepoint:raw_syscalls:sys_enter
/comm == "agent" || comm == "python"/
{
  @sys_start[tid] = nsecs;
  @sys_id[tid] = args->id;
}

tracepoint:raw_syscalls:sys_exit
/@sys_start[tid]/
{
  @syscall_us[@sys_id[tid]] = hist((nsecs - @sys_start[tid]) / 1000);
  delete(@sys_start[tid]);
  delete(@sys_id[tid]);
}

5.5 Block I/O Completion Latency for Small Reads

#!/usr/bin/env bpftrace

tracepoint:block:block_rq_issue
{
  @rq_start[args->dev, args->sector] = nsecs;
  @rq_bytes[args->dev, args->sector] = args->bytes;
}

tracepoint:block:block_rq_complete
/@rq_start[args->dev, args->sector]/
{
  $bytes = @rq_bytes[args->dev, args->sector];
  if ($bytes <= 65536) {
    @small_read_us[args->dev] = hist((nsecs - @rq_start[args->dev, args->sector]) / 1000);
  }
  delete(@rq_start[args->dev, args->sector]);
  delete(@rq_bytes[args->dev, args->sector]);
}

5.6 Page Fault Latency During mmap-backed Retrieval

#!/usr/bin/env bpftrace

tracepoint:exceptions:page_fault_user
/comm == "agent" || comm == "python"/
{
  @pf_count[comm] = count();
}

kprobe:handle_mm_fault
/comm == "agent" || comm == "python"/
{
  @mm_fault_start[tid] = nsecs;
}

kretprobe:handle_mm_fault
/@mm_fault_start[tid]/
{
  @mm_fault_us[comm] = hist((nsecs - @mm_fault_start[tid]) / 1000);
  delete(@mm_fault_start[tid]);
}

5.7 End-to-End Approximation: IRQ Exit to Agent Running

#!/usr/bin/env bpftrace

tracepoint:irq:irq_handler_exit
{
  @last_irq_exit[cpu] = nsecs;
}

tracepoint:sched:sched_switch
/str(args->next_comm) == "agent" || str(args->next_comm) == "python"/
{
  if (@last_irq_exit[cpu]) {
    @irq_to_agent_run_us[args->next_comm] = hist((nsecs - @last_irq_exit[cpu]) / 1000);
  }
}

This last probe is an approximation because a task may be woken by work caused by an IRQ on a different CPU. A production version should tag request IDs or socket cookies and correlate across CPUs.

6. Kernel Patch Sketches

Once the path is measurable, the next step is reducing variance. These are not drop-in upstream patches; they are RFC-grade sketches that show where kernel work could go.

6.1 Per-Step IRQ Attribution

/* include/linux/agent_latency.h */
struct agent_step_ctx {
    u64 step_id;
    u64 deadline_ns;
    u32 latency_budget_us;
};

/* task_struct extension, guarded by CONFIG_AGENT_LATENCY */
struct task_struct {
    ...
    struct agent_step_ctx *agent_ctx;
};
/* kernel/irq/handle.c - conceptual */
void note_agent_irq_completion(struct irq_desc *desc)
{
    struct agent_step_ctx *ctx = current->agent_ctx;

    if (!ctx)
        return;

    trace_agent_irq_stage(ctx->step_id,
                          desc->irq_data.irq,
                          ktime_get_ns());
}

6.2 Latency-First IRQ Placement

/* Prefer IRQ CPUs near the agent control-loop CPU. */
static int agent_select_irq_cpu(struct irq_desc *desc,
                                struct task_struct *agent)
{
    int cpu = task_cpu(agent);

    if (cpu_online(cpu) && !cpu_is_noisy(cpu))
        return cpu;

    return cpumask_first_and(agent->cpus_ptr, housekeeping_cpumask(HK_TYPE_DOMAIN));
}

6.3 Softirq Budget Guardrail for Agent Cores

/* net/core/dev.c - conceptual */
if (cpu_has_agent_latency_task(smp_processor_id())) {
    budget = min_t(int, budget, sysctl_agent_napi_budget);
    time_limit = min_t(u64, time_limit, AGENT_SOFTIRQ_BUDGET_NS);
}
Principle: agent cores should not become softirq dumping grounds. The goal is not to suppress networking or storage completions; it is to make deferred work explicit, bounded, and attributable.

7. Evaluation Plan

A serious evaluation should measure distributions, not averages. Agent systems care about p99 and p999 because rare stalls occur repeatedly across task execution.

MetricWhy it mattersProbe/tool
IRQ handler latencyHardirq path costirq_handler_entry/exit
Softirq/NAPI durationDeferred network work tailsoftirq_entry/exit, napi_poll
Wakeup-to-run latencyScheduler contributionsched_wakeup + sched_switch
Small read latencyRetrieval/index pathblock_rq_issue/complete
mmap fault latencyMemory-mapped index behaviorhandle_mm_fault
Agent step p99/p999User-visible loop stabilityapplication step timestamps
Experiment matrix:

Baseline:
  irqbalance=on
  default IRQ affinity
  default NAPI/softirq behavior
  unpinned agent process

Tuned:
  IRQ affinity pinned away from agent core
  agent process CPU-pinned
  NAPI budget bounded on agent cores
  io_uring or epoll completions tagged by step_id

Report:
  p50/p95/p99/p999 agent step latency
  IRQ-to-agent wakeup histogram
  softirq runtime histogram
  CPU migration count
  small I/O completion histogram

Conclusion

For agentic AI workloads, interrupts are not invisible plumbing. A tool response can traverse MSI-X, APIC, IDT dispatch, x86 kernel entry, generic IRQ handling, driver top halves, softirq/NAPI, scheduler wakeups, and runtime event loops before the agent can continue. Each stage is small. The chain is not.

The next useful kernel optimization for agent systems is not only faster handlers; it is attribution: knowing which interrupt, which softirq, which completion, and which scheduler decision delayed a specific agent step.