ARM vs x86 for Agentic AI: The Kernel Stack Changes That Actually Matter

Abstract. Agentic AI is a control-plane workload masquerading as an inference workload. The hot path is often not matrix math but a sequence of kernel-mediated boundaries: IRQ completion, softirq or block completion, scheduler wakeup, C-state exit, frequency selection, cache warmup, page-table stability, userspace parsing, and the next outbound operation. This produces a pathological performance shape: small sub-millisecond penalties repeated across hundreds or thousands of dependent steps. The Linux kernel today does not have a strong notion of a step-critical orchestration thread. It sees short bursts, low average utilization, occasional blocking, and abundant opportunities to rebalance, downclock, migrate, reclaim, or defer work. Those are reasonable defaults for generic workloads and often the wrong defaults for agents. My view is that the highest-leverage stack changes on both ARM and x86 sit in five surfaces: scheduler wakeup placement, IRQ-to-userspace completion locality, memory and TLB stability, I/O runtime completion semantics, and power-state control. The architecture-specific failure mode differs. On ARM, the repeated debt is usually DVFS lag, cpuidle exit latency, and heterogeneous-cluster misplacement. On x86, it is more often NUMA drift, interrupt spray, TLB shootdown noise, and housekeeping interference across large core topologies.

1. The Wrong Mental Model

Most people still profile agent systems as if they were either training jobs or steady-state token generators. They are neither.

The execution pattern of an agent is a long dependency chain of short critical sections. A runtime issues work to a model server, a tool process, a remote RPC endpoint, a filesystem, a browser session, or a local sandbox. Then it blocks in epoll_wait(), io_uring_enter(), a futex, an eventfd, a timerfd, or a socket receive path. The next visible unit of user experience is therefore determined by what the kernel does with the completion, not only by what the model did before the completion.

agent step:
  submit work
  block on event source
  device or remote side completes
  IRQ fires
  softirq / blk-mq / task_work / wake_q path runs
  target task becomes runnable
  scheduler chooses CPU
  task returns to userspace
  parse / plan / emit next operation
  repeat

critical latency =
  completion_to_runnable
+ runnable_to_oncpu
+ idle_exit
+ freq_ramp
+ cache_and_tlb_rewarm
+ userspace_resume

That means the relevant optimization target is not “faster model kernel launch” in isolation. It is minimizing the variance and mean of the completion-to-next-issue loop. If you lose 80 microseconds at wakeup, 120 microseconds to frequency ramp, 60 microseconds to migration and cold-cache effects, and 50 microseconds to an avoidable TLB disturbance, you have already lost more than many agent frameworks spend on the actual control logic for a step.

Key reframing: the operating system is part of the agent runtime. Linux is not a neutral substrate here; it actively shapes how quickly one reasoning step can become the next.

2. The Critical Path Through Linux

The chain below is where I would focus the first six months of serious systems work. Every stage is measurable, and every stage has architecture-specific failure modes.

Figure 1. Agent responsiveness depends on the entire completion-to-next-issue loop, not only model execution time.

Stage	Kernel structures involved	Typical failure mode	What I would change
Completion arrival	MSI/MSI-X, IRQ descriptors, blk-mq tags, NAPI poll lists	IRQ lands on a distant or noisy CPU	Preserve a home CPU or home cluster for step-critical control threads and bias completion toward it
Deferred completion	softirq, task_work, workqueues, ksoftirqd	Completion falls out of the fast path into overloaded generic contexts	Stronger affinity propagation and reduced handoff count for latency-marked flows
Wakeup	wake_q, ttwu, rq selection, EEVDF/CFS placement	Task wakes on the wrong CPU or after unnecessary migration	Locality-preserving wakeup placement and anti-migration windows
Resume on CPU	cpuidle, cpufreq, schedutil, uclamp, idle governors	CPU is in the wrong idle state or at the wrong frequency	Windowed performance floors and shallow-idle bias during active control loops
Userspace run	TLB, page tables, slab, page allocator, page cache	Cold TLBs, remote memory, reclaim noise, mmap churn	Step-local memory stability accounting and control-plane-local allocators

3. Scheduler: Wakeup Determinism Beats Average Fairness

The default scheduler objective is fairness and high aggregate utilization, not minimal boundary latency for bursty orchestration threads. For agents, the highest-value scheduler improvement is not a new general-purpose class for everything. It is a precise recognition of step-critical threads whose wakeup-to-run delay dominates user-visible latency even when they consume little CPU time overall.

3.1 The Wakeup Path I Care About

On Linux, the important path is not just “the task is runnable.” It is what happens inside the wakeup path: try_to_wake_up(), CPU selection, runqueue locking, task enqueue, preemption decision, and whether the destination CPU matches the locality of the completion source and the task’s prior hot state. If an agent runtime wakes on CPU 5, but the relevant socket completion arrived on CPU 17, and the task last touched hot metadata on CPU 6 in the same LLC slice, generic balancing has already lost important context.

idealized agent-aware wakeup policy:
  if task has active_step_window:
      prefer previous_cpu if cache-hot and not overloaded
      else prefer completion_cpu if same LLC / cluster
      else prefer task.home_cpu
      suppress load-balance migration unless latency budget violated
  else:
      use normal policy

3.2 What I Would Add

Latency-tagged control threads

Add a task or cgroup hint so the runtime can mark only the orchestration thread, not the whole process tree. This hint should decay in short windows rather than acting like permanent priority inflation.

Home-CPU semantics

Maintain a preferred home CPU or cluster chosen from recent completion locality, cache warmth, and memory locality. This is weaker than hard pinning and stronger than best-effort fairness.

Anti-migration windows

For tens or hundreds of microseconds after a completion, avoid migrations triggered by transient queue imbalance. That time range is exactly where many agents perform parse-plan-dispatch work.

Deterministic wakeup slots

For extremely latency-sensitive deployments, add a mode where specific CPUs reserve micro-windows for orchestration wakeups, reducing runqueue variability and preemption debt.

That last item is important because it moves the design from “please try to be local” toward “the system can bound where and when this thread will run.” It is also one of the cleanest differentiators versus GPU-centric systems thinking, which usually assumes the host scheduler is background noise.

4. IRQ, Softirq, and Completion Path: This Is Where Agent Latency Starts

In many agent traces, the real latency pathology begins long before userspace sees anything. It begins when device completion and deferred work lose locality or fall off the fast path into generic worker contexts.

4.1 Network and Storage Should Not Be Treated Symmetrically

Agent runtimes often mix network completions, local file reads, subprocess I/O, browser automation traffic, and model-server RPCs. Each one touches different kernel paths: NAPI and GRO for network, blk-mq for storage, task_work for io_uring, eventfd/futex for internal runtimes. Their common requirement is not maximum throughput. It is minimal variance between completion and resumption.

Path	Kernel fast path	Where it goes wrong	Preferred fix
TCP/QUIC RPC reply	IRQ → NAPI poll → socket receive → wakeup	RPS/RFS, GRO batching, or ksoftirqd detours add jitter	Affinity propagation to the agent home core and bounded batching for latency-marked sockets
Local NVMe read	IRQ → blk-mq completion → wakeup or task_work	Completion CPU differs from consumer CPU; queue depth tuned for throughput not latency	Per-queue affinity and fast completion steering to the control thread locality island
io_uring tool chain	submission queue → device work → CQE → task_work / eventfd	CQE delivery coalesced too aggressively or consumed on the wrong CPU	Latency-mode CQE delivery and userspace consumption aligned to a home CPU

4.2 The Design Principle

I would preserve completion provenance. The kernel should remember enough about where a completion came from, who is likely to consume it, and which CPU or cluster is best suited to finish the step. Today that information gets diluted across layers. A wakeup knows there is a task. The scheduler knows there are runnable queues. The power subsystem knows there is utilization. But nothing explicitly carries “this exact thread is finishing an active control loop and should resume on a warm nearby CPU now.”

My rule: the number of CPU handoffs between device completion and userspace resumption should be treated as a first-class SLO for agent platforms.

5. ARM: The Main Enemy Is Overeager Power Management

ARM systems often have excellent energy efficiency and very good raw performance-per-watt, but agent workloads expose the downside of policy loops built for smoother utilization curves. The mistake is to interpret a short blocked interval as meaningful quiescence when it is actually a phase boundary in a still-active control loop.

5.1 schedutil Is Usually Too Reactive for Bursty Orchestration

On ARM, schedutil frequently makes the “right” decision for the wrong workload: the task blocked, utilization dropped, so the frequency should decay. An agent runtime may then wake tens or hundreds of microseconds later and immediately pay ramp latency. The exact microarchitectural cost varies by SoC and firmware, but the pattern is consistent.

agent blocks on completion
  -> util_est decays
  -> schedutil reduces target frequency
  -> cpuidle picks deeper state
  -> next completion arrives
  -> IRQ handled on sibling cluster
  -> control task wakes cold
  -> latency debt repeats

I would add a kernel-visible active step window with an expiration timestamp. During the window, schedutil would clamp the effective minimum utilization seen for the task or cgroup, and cpuidle would bias away from deeper states on the relevant CPU or cluster. This is not the same as “performance governor.” It is a short-lived guardrail.

5.2 big.LITTLE Placement Needs Much Tighter Semantics

Capacity-aware scheduling is necessary but not sufficient. Agent control threads should not bounce between LITTLE and big cores mid-conversation simply because the instantaneous load picture changed. If a control thread is the sink for network and tool completions, it should stay on a home island until the active step window decays, unless there is a compelling thermal or contention reason to move it.

5.3 The Deterministic Scheduling Embodiment

For a more aggressive ARM design, I would create a deterministic scheduling mode for one or more orchestration cores:

per-cluster orchestration core:
  reserve a fixed micro-timeslice every T microseconds
  admit only latency-marked control threads into that slot
  align IRQ steering for marked sockets / queues to the same cluster
  hold minimum freq during slot-bearing active windows
  disallow deep idle while slot reservation is live

This matters because it converts an otherwise probabilistic wakeup path into a partially time-bounded one. NVIDIA-style host orchestration usually assumes the CPU side is a best-effort feeder for the GPU. A deterministic orchestration slot says the opposite: for agent systems, host-side control latency is part of the product, so we reserve scheduling structure for it.

5.4 cpuidle Should Be Completion-Cadence Aware

The idle-state question is not “how idle is the CPU in aggregate?” but “what is the recent cadence of completions for this latency-marked flow?” If the same control thread has been reactivated every 200 microseconds to 3 milliseconds for the last several steps, deep idle entry is probably a mistake even if aggregate CPU utilization remains low.

6. x86: The Main Enemy Is Topology and Noise

x86 systems usually do not fail because they lack raw speed. They fail because large topologies create too many ways to waste locality. Multiple NUMA nodes, large LLC slices, SMT siblings, broad interrupt spray, background daemons, remote page cache, shootdowns, and housekeeping traffic can make a monster server behave worse than a smaller machine for interactive orchestration.

6.1 NUMA Locality Must Include the Control Plane

Most NUMA guidance is built around throughput threads. Agent runtimes need a different policy: keep the orchestration thread, its frequently touched metadata, its page cache working set, its socket buffers, and its short-lived tool I/O state on the same NUMA home. The thread itself may be light; the penalty of remoteness is not.

control-plane NUMA policy:
  choose home node for agent runtime
  allocate hot metadata on home node
  prefer wakeup CPUs in home node
  steer IRQs for relevant queues to home node
  keep page cache readahead and socket memory local when possible

6.2 TLB Shootdowns Are an Underappreciated Tax

Agent frameworks often create exactly the kind of memory behavior that produces shootdown noise: short-lived subprocesses, mmap-heavy tools, sandbox remaps, JITs, browser automation sessions, shared-memory buffers, and background memory pressure. On x86, a badly timed TLB shootdown or page-table invalidation can perturb the very control thread you are trying to wake quickly.

I would add step-scoped accounting for:

tlb_flush events occurring between completion and next issue
remote IPI traffic attributable to address-space changes in neighboring workloads
migrations that coincide with page-table refill and cache rewarm events

This is one of those domains where average counters are misleading. You need correlation with agent step boundaries, not just system totals.

6.3 IRQ Isolation Is More Important Than Raw Core Count

On x86, I would much rather have four cores with disciplined IRQ isolation, bounded housekeeping, and stable NUMA locality than sixty-four cores with indiscriminate interrupt delivery. Agentic responsiveness collapses when orchestration CPUs share their immediate neighborhood with unrelated packet floods, storage storms, or timer-heavy background services.

My x86 thesis: once the machine is big enough, the main performance problem is coordination complexity, not arithmetic throughput.

7. Memory Management: Keep the Control Plane Warm

Agent runtimes are not purely CPU-latency problems. They are also memory hygiene problems. The control thread often touches a surprisingly rich working set: token state, parser buffers, JSON trees, tool descriptors, socket state, page cache metadata, sandbox bookkeeping, and policy state for what to do next. None of this is individually large. Together it is fragile.

7.1 What I Would Change

Local allocators for control threads

Small hot allocations for orchestration threads should bias toward node-local, CPU-local caches and avoid incidental cross-node refill where possible.

Reclaim guardrails

Background reclaim and compaction should become less willing to perturb pages touched repeatedly inside active step windows, especially on systems with ample free memory.

Page-cache stability

Tool artifacts, prompt templates, policy files, and other repeatedly opened metadata should be protected from pointless churn caused by broader file-cache pressure.

TLB disturbance accounting

Expose per-cgroup or per-task counters for TLB flushes and page-table disruption during marked orchestration windows.

This is also where future CXL and heterogeneous-memory systems become interesting. Even without redesigning everything around new memory tiers, the kernel should know that some agent state is latency-critical control state and should stay close to CPU caches and low-latency memory.

8. I/O Runtime: io_uring Is Useful, But It Still Needs Kernel Help

Many people see io_uring and assume the problem is solved. It is not. io_uring reduces some userspace/kernel transition overhead and enables efficient async composition, but the completion path still runs through the same underlying locality and scheduling machinery. A CQE delivered after the wrong batching decision or consumed on the wrong CPU is still late.

8.1 What I Would Want from io_uring for Agents

A latency mode that prioritizes timely CQE visibility over aggressive completion batching.
Better affinity coupling between submission-side control threads and completion-side wakeups.
Step-aware observability joining CQE timestamps to scheduler, cpuidle, and cpufreq events.
Cleaner policy hooks so runtimes can say “this queue belongs to a step-critical orchestration loop.”

desired derived metrics:
  cqe_visible_ns - device_complete_ns
  wakeup_ns - cqe_visible_ns
  oncpu_ns - wakeup_ns
  next_issue_ns - oncpu_ns
  migrations_in_between
  freq_transitions_in_between
  tlb_flushes_in_between

9. The Measurement Model

Before changing policy, I would build a measurement pipeline that treats the agent step as the unit of analysis. Linux already exposes most of what is needed through tracepoints, BPF, PMU events, and perf infrastructure.

Metric	Collection source	Why it matters
completion_to_runnable_ns	IRQ, softirq, blk-mq, wakeup tracepoints	Measures the boundary cost before the task is runnable
runnable_to_oncpu_ns	sched_wakeup + sched_switch	Shows scheduler contention or placement mistakes
migrations_per_step	sched_switch + task CPU history	Correlates locality loss with step latency
freq_drop_between_steps	cpufreq transitions, scheduler utilization	Captures DVFS overreaction on ARM and sometimes x86
deep_idle_entries_during_active_window	cpuidle tracepoints	Explains wakeup penalties in bursty control loops
tlb_flushes_per_step	mm/tlb tracepoints, IPIs, perf	Makes memory-management interference visible
remote_irq_ratio	IRQ CPU vs task home CPU	Direct signal of affinity failure

tracepoints I would join:
  irq_handler_entry / irq_handler_exit
  softirq_entry / softirq_exit
  napi_poll
  blk_mq_complete_request
  sched_wakeup
  sched_switch
  mm_tlb_flush_*
  cpu_frequency
  cpu_idle
  io_uring_cqe or task_work related hooks

BPF key:
  {agent_id, step_id, task_pid, home_cpu, completion_cpu, wake_cpu}

10. A Concrete Kernel Design: Agent Latency Control Plane

The cleanest way to think about all of this is as a tiny cross-subsystem control plane in the kernel. Not a giant new scheduler. Not a magic governor. A small piece of shared state that says: this task or cgroup is in an active orchestration window, here is its home locality, and here is its latency budget.

Figure 2. The useful abstraction is a shared control plane for short-lived latency-sensitive agent windows.

struct agent_window {
    u64 expires_ns;
    u32 agent_id;
    s16 home_cpu;
    s16 home_node;
    u32 latency_budget_us;
    u16 flags;   /* anti_migrate, shallow_idle, dvfs_floor, deterministic_slot */
};

Subsystem interactions would look like this:

Userspace runtime marks a control thread or cgroup and renews an active window when issuing or receiving a step-critical operation.
IRQ or completion paths consult the active window to bias CPU selection and delivery locality.
The scheduler prefers the home CPU, completion CPU, or same-LLC CPU while the window remains valid.
cpufreq and cpuidle apply bounded policy overrides only for the relevant CPUs and only while the window remains active.
Memory management accounts for TLB disturbance and optionally shields hot control-plane pages from needless perturbation.

11. If I Had to Prioritize the Patch Queue

I would not start by changing everything. I would stage the work in a way that makes proof easy.

Phase	Primary subsystem	Deliverable	Proof of value
1	Observability	BPF step tracer joining completion, wakeup, frequency, idle, and TLB events	Tail-latency decomposition per agent step
2	Scheduler	Home CPU + anti-migration window for marked threads	Lower runnable_to_oncpu and fewer migrations per step
3	IRQ / completion	Affinity propagation for marked queues and sockets	Lower completion_to_runnable and remote_irq_ratio
4	ARM power path	Windowed uclamp / schedutil floor + shallow idle bias	Reduced freq_drop_between_steps and idle-exit debt
5	x86 topology hygiene	NUMA-home control plane + IRQ / housekeeping isolation	Lower remote memory and interrupt noise for orchestration cores
6	Memory	TLB disturbance accounting and control-plane reclaim guardrails	Reduced long-tail stalls correlated with mm activity

12. Architecture-Specific Bottom Line

ARM

Introduce short-lived active step windows integrated with schedutil and cpuidle.
Keep orchestration threads on a stable big-core or stable cluster home until the window expires.
Route IRQs and completion work to the same locality island.
Consider deterministic orchestration slots for the most latency-sensitive deployments.

x86

Choose a NUMA home for the control plane and keep wakeups local.
Isolate agent-critical CPUs from indiscriminate interrupt spray and housekeeping.
Measure and reduce TLB and page-table disturbance within step windows.
Treat topology discipline as more important than raw core count.

13. Final View

If you ask me what part of the ARM or x86 kernel stack I would change first to boost agentic performance, the answer is not “the model driver” and not “some generic AI acceleration path.” It is the part of Linux that decides what happens between one step and the next. That means wakeup placement, completion locality, power-state continuity, and memory stability.

On ARM, the real bug is often that the machine keeps trying to save power in the middle of an active reasoning conversation. On x86, the real bug is often that the machine has so much topology and background noise that it cannot preserve locality for a tiny but critical control thread. In both cases, the winning design is the same: teach the kernel that short orchestration gaps are not dead time. They are active time with deferred visibility.

The short version: if I had to redesign only three surfaces, I would redesign scheduler wakeups, IRQ-to-userspace completion, and memory-stability accounting, then attach bounded power policy on top. That is where the real step-function gains are.