From IRQ to Userspace: Kernel Latency Paths for Agentic AI Workloads

Abstract

Agentic AI workloads are dominated by control-loop latency: tool calls, small responses, metadata lookups, event-loop wakeups, and short bursts of parsing. In this regime, interrupts are not background noise. They are part of the application’s critical path.

This note follows a single completion event from device MSI-X delivery to userspace wakeup. It maps the path onto x86 architectural state, estimates stage-level latency from nanoseconds to milliseconds, and provides eBPF probes that make the path measurable in production. The thesis is simple: agentic systems need attributable IRQ paths. Without attribution, every stall collapses into “Python is slow” or “the agent is slow,” even when the root cause is hardirq placement, softirq backlog, scheduler wakeup delay, or small-I/O completion jitter.

1. Agentic AI as a Latency-Amplified OS Workload

Agentic systems execute repeated control loops: plan, call a tool, wait for completion, parse output, update state, and continue. Each step is small, but the number of steps is large. The result is latency amplification.

agent_step_latency = L_model_step
                   + L_tool_rpc
                   + L_irq_delivery
                   + L_softirq_or_threaded_irq
                   + L_scheduler_wakeup
                   + L_userspace_parse

total_task_latency ≈ Σ agent_step_latency over hundreds or thousands of steps

Key observation: a 20–80 µs delay that would be irrelevant in a throughput server becomes visible when multiplied across thousands of tool completions and wakeups.

Figure 1. Agent workloads transform small kernel delays into visible tail latency through repeated control-loop execution.

2. End-to-End IRQ → Userspace Path

The critical path begins when a device completes work. For a networked tool call this is commonly a NIC RX completion. For local retrieval it may be NVMe completion. The device raises an MSI-X interrupt, the local APIC accepts it, the CPU uses the interrupt vector to index the IDT, and Linux dispatches through the generic IRQ layer.

Figure 2. The interrupt-to-userspace path. In quiet systems, the early hardware path is nanosecond-to-microsecond scale; the large tail comes from bottom halves, queueing, and scheduler wakeup.

Device → MSI-X → Local APIC → IDT[vector] → x86 entry stub
       → irq_desc → driver top half → softirq/NAPI or threaded IRQ
       → socket/block completion → wake_up_process()
       → scheduler → epoll/io_uring/read returns

3. x86 Register and Control-State Map

Intel and AMD differ in implementation details, but the x86-64 architectural interrupt model is broadly shared: the CPU receives a vector, indexes the IDT using IDTR, saves architectural state, switches stacks on privilege transition, and transfers control to the kernel entry stub.

Figure 3. Register/control-state map for x86 interrupt dispatch. The model applies to Intel and AMD x86-64; vendor differences usually appear below this abstraction in microarchitecture and APIC implementation details.

State	Role	Why it matters for latency
`IDTR`	Points to the Interrupt Descriptor Table.	IDT lookup is fast, but entry path placement affects cache footprint.
`RIP / CS / RFLAGS`	Saved interrupted state.	Interrupts can preempt agent code at arbitrary instruction boundaries.
`RSP / SS / TSS.RSP0 / IST`	Stack switch on privilege transition.	Stack/cache locality affects entry/exit overhead.
`APIC IRR / ISR`	Pending and in-service interrupt tracking.	Backlog appears as delayed delivery and priority interactions.
`APIC TPR`	Priority threshold.	Priority policy can defer lower-priority interrupts.
`APIC EOI`	Signals interrupt completion.	EOI timing interacts with interrupt moderation and re-entry.

4. Cycle-Level Latency Annotations

Exact latency depends on CPU frequency, cache state, device, interrupt moderation, kernel config, PREEMPT_RT, workload, and contention. The table below gives practical order-of-magnitude values for reasoning and instrumentation.

Stage	Quiet-system order	Tail under contention	Typical cause of tail
Device CQ update	100 ns – 2 µs	5–20 µs	PCIe/device queue congestion
MSI-X delivery to Local APIC	100 ns – 1 µs	2–10 µs	interrupt remapping, posted interrupt delay
IDT vector dispatch	50–300 ns	<1 µs	cache miss on entry path
x86 entry + state save	0.5–3 µs	5–15 µs	mitigations, cache/TLB state, stack locality
generic IRQ dispatch	1–5 µs	10–40 µs	shared IRQ, lock contention, tracing overhead
driver top half	1–15 µs	50–200 µs	excess work in hardirq, queue handling
NAPI / softirq	5–50 µs	100 µs – 2 ms	budget exhaustion, ksoftirqd backlog, wrong CPU
TCP/block completion	2–50 µs	100 µs – 1 ms	socket locks, epoll wakeups, block queueing
scheduler wakeup to run	5–30 µs	100 µs – 5 ms	runqueue pressure, migration, noisy cores
userspace event-loop resume	1–20 µs	100 µs+	runtime contention, GC, parser work

Design implication: optimize the median for throughput, but optimize p99/p999 for agents. A system with 20 µs median IRQ-to-wakeup latency and 2 ms p999 latency will feel unstable under multi-step tool execution.

5. eBPF Tracing Code for Each Stage

The first step is observability. The following probes are intentionally practical: they can be run with bpftrace on many Linux systems and then hardened into libbpf tools later.

5.1 IRQ Entry/Exit Latency by IRQ Number

#!/usr/bin/env bpftrace

tracepoint:irq:irq_handler_entry
{
  @irq_start[args->irq] = nsecs;
  @irq_name[args->irq] = args->name;
}

tracepoint:irq:irq_handler_exit
/@irq_start[args->irq]/
{
  $delta = nsecs - @irq_start[args->irq];
  @irq_latency_us[args->irq, @irq_name[args->irq]] = hist($delta / 1000);
  delete(@irq_start[args->irq]);
}

interval:s:10
{
  print(@irq_latency_us);
  clear(@irq_latency_us);
}

5.2 Softirq / NAPI Tail Latency

#!/usr/bin/env bpftrace

tracepoint:irq:softirq_entry
{
  @soft_start[args->vec] = nsecs;
}

tracepoint:irq:softirq_exit
/@soft_start[args->vec]/
{
  @softirq_us[args->vec] = hist((nsecs - @soft_start[args->vec]) / 1000);
  delete(@soft_start[args->vec]);
}

kprobe:napi_poll
{
  @napi_start[tid] = nsecs;
}

kretprobe:napi_poll
/@napi_start[tid]/
{
  @napi_poll_us[comm] = hist((nsecs - @napi_start[tid]) / 1000);
  delete(@napi_start[tid]);
}

5.3 Scheduler Wakeup-to-Run Latency for Agent Processes

#!/usr/bin/env bpftrace

tracepoint:sched:sched_wakeup
/str(args->comm) == "agent" || str(args->comm) == "python"/
{
  @wakeup[args->pid] = nsecs;
}

tracepoint:sched:sched_switch
/@wakeup[args->next_pid]/
{
  @wakeup_to_run_us[args->next_comm] = hist((nsecs - @wakeup[args->next_pid]) / 1000);
  delete(@wakeup[args->next_pid]);
}

5.4 Syscall Latency: open/read/stat/mmap/epoll

#!/usr/bin/env bpftrace

tracepoint:raw_syscalls:sys_enter
/comm == "agent" || comm == "python"/
{
  @sys_start[tid] = nsecs;
  @sys_id[tid] = args->id;
}

tracepoint:raw_syscalls:sys_exit
/@sys_start[tid]/
{
  @syscall_us[@sys_id[tid]] = hist((nsecs - @sys_start[tid]) / 1000);
  delete(@sys_start[tid]);
  delete(@sys_id[tid]);
}

5.5 Block I/O Completion Latency for Small Reads

#!/usr/bin/env bpftrace

tracepoint:block:block_rq_issue
{
  @rq_start[args->dev, args->sector] = nsecs;
  @rq_bytes[args->dev, args->sector] = args->bytes;
}

tracepoint:block:block_rq_complete
/@rq_start[args->dev, args->sector]/
{
  $bytes = @rq_bytes[args->dev, args->sector];
  if ($bytes <= 65536) {
    @small_read_us[args->dev] = hist((nsecs - @rq_start[args->dev, args->sector]) / 1000);
  }
  delete(@rq_start[args->dev, args->sector]);
  delete(@rq_bytes[args->dev, args->sector]);
}

5.6 Page Fault Latency During mmap-backed Retrieval

#!/usr/bin/env bpftrace

tracepoint:exceptions:page_fault_user
/comm == "agent" || comm == "python"/
{
  @pf_count[comm] = count();
}

kprobe:handle_mm_fault
/comm == "agent" || comm == "python"/
{
  @mm_fault_start[tid] = nsecs;
}

kretprobe:handle_mm_fault
/@mm_fault_start[tid]/
{
  @mm_fault_us[comm] = hist((nsecs - @mm_fault_start[tid]) / 1000);
  delete(@mm_fault_start[tid]);
}

5.7 End-to-End Approximation: IRQ Exit to Agent Running

#!/usr/bin/env bpftrace

tracepoint:irq:irq_handler_exit
{
  @last_irq_exit[cpu] = nsecs;
}

tracepoint:sched:sched_switch
/str(args->next_comm) == "agent" || str(args->next_comm) == "python"/
{
  if (@last_irq_exit[cpu]) {
    @irq_to_agent_run_us[args->next_comm] = hist((nsecs - @last_irq_exit[cpu]) / 1000);
  }
}

This last probe is an approximation because a task may be woken by work caused by an IRQ on a different CPU. A production version should tag request IDs or socket cookies and correlate across CPUs.

6. Kernel Patch Sketches

Once the path is measurable, the next step is reducing variance. These are not drop-in upstream patches; they are RFC-grade sketches that show where kernel work could go.

6.1 Per-Step IRQ Attribution

/* include/linux/agent_latency.h */
struct agent_step_ctx {
    u64 step_id;
    u64 deadline_ns;
    u32 latency_budget_us;
};

/* task_struct extension, guarded by CONFIG_AGENT_LATENCY */
struct task_struct {
    ...
    struct agent_step_ctx *agent_ctx;
};

/* kernel/irq/handle.c - conceptual */
void note_agent_irq_completion(struct irq_desc *desc)
{
    struct agent_step_ctx *ctx = current->agent_ctx;

    if (!ctx)
        return;

    trace_agent_irq_stage(ctx->step_id,
                          desc->irq_data.irq,
                          ktime_get_ns());
}

6.2 Latency-First IRQ Placement

/* Prefer IRQ CPUs near the agent control-loop CPU. */
static int agent_select_irq_cpu(struct irq_desc *desc,
                                struct task_struct *agent)
{
    int cpu = task_cpu(agent);

    if (cpu_online(cpu) && !cpu_is_noisy(cpu))
        return cpu;

    return cpumask_first_and(agent->cpus_ptr, housekeeping_cpumask(HK_TYPE_DOMAIN));
}

6.3 Softirq Budget Guardrail for Agent Cores

/* net/core/dev.c - conceptual */
if (cpu_has_agent_latency_task(smp_processor_id())) {
    budget = min_t(int, budget, sysctl_agent_napi_budget);
    time_limit = min_t(u64, time_limit, AGENT_SOFTIRQ_BUDGET_NS);
}

Principle: agent cores should not become softirq dumping grounds. The goal is not to suppress networking or storage completions; it is to make deferred work explicit, bounded, and attributable.

7. Evaluation Plan

A serious evaluation should measure distributions, not averages. Agent systems care about p99 and p999 because rare stalls occur repeatedly across task execution.

Metric	Why it matters	Probe/tool
IRQ handler latency	Hardirq path cost	`irq_handler_entry/exit`
Softirq/NAPI duration	Deferred network work tail	`softirq_entry/exit`, `napi_poll`
Wakeup-to-run latency	Scheduler contribution	`sched_wakeup` + `sched_switch`
Small read latency	Retrieval/index path	`block_rq_issue/complete`
mmap fault latency	Memory-mapped index behavior	`handle_mm_fault`
Agent step p99/p999	User-visible loop stability	application step timestamps

Experiment matrix:

Baseline:
  irqbalance=on
  default IRQ affinity
  default NAPI/softirq behavior
  unpinned agent process

Tuned:
  IRQ affinity pinned away from agent core
  agent process CPU-pinned
  NAPI budget bounded on agent cores
  io_uring or epoll completions tagged by step_id

Report:
  p50/p95/p99/p999 agent step latency
  IRQ-to-agent wakeup histogram
  softirq runtime histogram
  CPU migration count
  small I/O completion histogram

Conclusion

For agentic AI workloads, interrupts are not invisible plumbing. A tool response can traverse MSI-X, APIC, IDT dispatch, x86 kernel entry, generic IRQ handling, driver top halves, softirq/NAPI, scheduler wakeups, and runtime event loops before the agent can continue. Each stage is small. The chain is not.

The next useful kernel optimization for agent systems is not only faster handlers; it is attribution: knowing which interrupt, which softirq, which completion, and which scheduler decision delayed a specific agent step.