ARM vs x86 for Agentic AI: The Kernel Stack Changes That Actually Matter
If you want agents to feel fast, do not start with FLOPS charts. Start with try_to_wake_up(), IRQ and softirq locality, TLB stability, NUMA discipline, io_uring completion delivery, schedutil, and cpuidle.
Abstract. Agentic AI is a control-plane workload masquerading as an inference workload. The hot path is often not matrix math but a sequence of kernel-mediated boundaries: IRQ completion, softirq or block completion, scheduler wakeup, C-state exit, frequency selection, cache warmup, page-table stability, userspace parsing, and the next outbound operation. This produces a pathological performance shape: small sub-millisecond penalties repeated across hundreds or thousands of dependent steps. The Linux kernel today does not have a strong notion of a step-critical orchestration thread. It sees short bursts, low average utilization, occasional blocking, and abundant opportunities to rebalance, downclock, migrate, reclaim, or defer work. Those are reasonable defaults for generic workloads and often the wrong defaults for agents. My view is that the highest-leverage stack changes on both ARM and x86 sit in five surfaces: scheduler wakeup placement, IRQ-to-userspace completion locality, memory and TLB stability, I/O runtime completion semantics, and power-state control. The architecture-specific failure mode differs. On ARM, the repeated debt is usually DVFS lag, cpuidle exit latency, and heterogeneous-cluster misplacement. On x86, it is more often NUMA drift, interrupt spray, TLB shootdown noise, and housekeeping interference across large core topologies.
1. The Wrong Mental Model
Most people still profile agent systems as if they were either training jobs or steady-state token generators. They are neither.
The execution pattern of an agent is a long dependency chain of short critical sections. A runtime issues work to a model server, a tool process, a remote RPC endpoint, a filesystem, a browser session, or a local sandbox. Then it blocks in epoll_wait(), io_uring_enter(), a futex, an eventfd, a timerfd, or a socket receive path. The next visible unit of user experience is therefore determined by what the kernel does with the completion, not only by what the model did before the completion.
agent step:
submit work
block on event source
device or remote side completes
IRQ fires
softirq / blk-mq / task_work / wake_q path runs
target task becomes runnable
scheduler chooses CPU
task returns to userspace
parse / plan / emit next operation
repeat
critical latency =
completion_to_runnable
+ runnable_to_oncpu
+ idle_exit
+ freq_ramp
+ cache_and_tlb_rewarm
+ userspace_resume
That means the relevant optimization target is not “faster model kernel launch” in isolation. It is minimizing the variance and mean of the completion-to-next-issue loop. If you lose 80 microseconds at wakeup, 120 microseconds to frequency ramp, 60 microseconds to migration and cold-cache effects, and 50 microseconds to an avoidable TLB disturbance, you have already lost more than many agent frameworks spend on the actual control logic for a step.
Key reframing: the operating system is part of the agent runtime. Linux is not a neutral substrate here; it actively shapes how quickly one reasoning step can become the next.
2. The Critical Path Through Linux
The chain below is where I would focus the first six months of serious systems work. Every stage is measurable, and every stage has architecture-specific failure modes.
| Stage | Kernel structures involved | Typical failure mode | What I would change |
|---|---|---|---|
| Completion arrival | MSI/MSI-X, IRQ descriptors, blk-mq tags, NAPI poll lists | IRQ lands on a distant or noisy CPU | Preserve a home CPU or home cluster for step-critical control threads and bias completion toward it |
| Deferred completion | softirq, task_work, workqueues, ksoftirqd | Completion falls out of the fast path into overloaded generic contexts | Stronger affinity propagation and reduced handoff count for latency-marked flows |
| Wakeup | wake_q, ttwu, rq selection, EEVDF/CFS placement | Task wakes on the wrong CPU or after unnecessary migration | Locality-preserving wakeup placement and anti-migration windows |
| Resume on CPU | cpuidle, cpufreq, schedutil, uclamp, idle governors | CPU is in the wrong idle state or at the wrong frequency | Windowed performance floors and shallow-idle bias during active control loops |
| Userspace run | TLB, page tables, slab, page allocator, page cache | Cold TLBs, remote memory, reclaim noise, mmap churn | Step-local memory stability accounting and control-plane-local allocators |
3. Scheduler: Wakeup Determinism Beats Average Fairness
The default scheduler objective is fairness and high aggregate utilization, not minimal boundary latency for bursty orchestration threads. For agents, the highest-value scheduler improvement is not a new general-purpose class for everything. It is a precise recognition of step-critical threads whose wakeup-to-run delay dominates user-visible latency even when they consume little CPU time overall.
3.1 The Wakeup Path I Care About
On Linux, the important path is not just “the task is runnable.” It is what happens inside the wakeup path: try_to_wake_up(), CPU selection, runqueue locking, task enqueue, preemption decision, and whether the destination CPU matches the locality of the completion source and the task’s prior hot state. If an agent runtime wakes on CPU 5, but the relevant socket completion arrived on CPU 17, and the task last touched hot metadata on CPU 6 in the same LLC slice, generic balancing has already lost important context.
idealized agent-aware wakeup policy:
if task has active_step_window:
prefer previous_cpu if cache-hot and not overloaded
else prefer completion_cpu if same LLC / cluster
else prefer task.home_cpu
suppress load-balance migration unless latency budget violated
else:
use normal policy
3.2 What I Would Add
Latency-tagged control threads
Add a task or cgroup hint so the runtime can mark only the orchestration thread, not the whole process tree. This hint should decay in short windows rather than acting like permanent priority inflation.
Home-CPU semantics
Maintain a preferred home CPU or cluster chosen from recent completion locality, cache warmth, and memory locality. This is weaker than hard pinning and stronger than best-effort fairness.
Anti-migration windows
For tens or hundreds of microseconds after a completion, avoid migrations triggered by transient queue imbalance. That time range is exactly where many agents perform parse-plan-dispatch work.
Deterministic wakeup slots
For extremely latency-sensitive deployments, add a mode where specific CPUs reserve micro-windows for orchestration wakeups, reducing runqueue variability and preemption debt.
That last item is important because it moves the design from “please try to be local” toward “the system can bound where and when this thread will run.” It is also one of the cleanest differentiators versus GPU-centric systems thinking, which usually assumes the host scheduler is background noise.
4. IRQ, Softirq, and Completion Path: This Is Where Agent Latency Starts
In many agent traces, the real latency pathology begins long before userspace sees anything. It begins when device completion and deferred work lose locality or fall off the fast path into generic worker contexts.
4.1 Network and Storage Should Not Be Treated Symmetrically
Agent runtimes often mix network completions, local file reads, subprocess I/O, browser automation traffic, and model-server RPCs. Each one touches different kernel paths: NAPI and GRO for network, blk-mq for storage, task_work for io_uring, eventfd/futex for internal runtimes. Their common requirement is not maximum throughput. It is minimal variance between completion and resumption.
| Path | Kernel fast path | Where it goes wrong | Preferred fix |
|---|---|---|---|
| TCP/QUIC RPC reply | IRQ → NAPI poll → socket receive → wakeup | RPS/RFS, GRO batching, or ksoftirqd detours add jitter | Affinity propagation to the agent home core and bounded batching for latency-marked sockets |
| Local NVMe read | IRQ → blk-mq completion → wakeup or task_work | Completion CPU differs from consumer CPU; queue depth tuned for throughput not latency | Per-queue affinity and fast completion steering to the control thread locality island |
| io_uring tool chain | submission queue → device work → CQE → task_work / eventfd | CQE delivery coalesced too aggressively or consumed on the wrong CPU | Latency-mode CQE delivery and userspace consumption aligned to a home CPU |
4.2 The Design Principle
I would preserve completion provenance. The kernel should remember enough about where a completion came from, who is likely to consume it, and which CPU or cluster is best suited to finish the step. Today that information gets diluted across layers. A wakeup knows there is a task. The scheduler knows there are runnable queues. The power subsystem knows there is utilization. But nothing explicitly carries “this exact thread is finishing an active control loop and should resume on a warm nearby CPU now.”
My rule: the number of CPU handoffs between device completion and userspace resumption should be treated as a first-class SLO for agent platforms.
5. ARM: The Main Enemy Is Overeager Power Management
ARM systems often have excellent energy efficiency and very good raw performance-per-watt, but agent workloads expose the downside of policy loops built for smoother utilization curves. The mistake is to interpret a short blocked interval as meaningful quiescence when it is actually a phase boundary in a still-active control loop.
5.1 schedutil Is Usually Too Reactive for Bursty Orchestration
On ARM, schedutil frequently makes the “right” decision for the wrong workload: the task blocked, utilization dropped, so the frequency should decay. An agent runtime may then wake tens or hundreds of microseconds later and immediately pay ramp latency. The exact microarchitectural cost varies by SoC and firmware, but the pattern is consistent.
agent blocks on completion
-> util_est decays
-> schedutil reduces target frequency
-> cpuidle picks deeper state
-> next completion arrives
-> IRQ handled on sibling cluster
-> control task wakes cold
-> latency debt repeats
I would add a kernel-visible active step window with an expiration timestamp. During the window, schedutil would clamp the effective minimum utilization seen for the task or cgroup, and cpuidle would bias away from deeper states on the relevant CPU or cluster. This is not the same as “performance governor.” It is a short-lived guardrail.
5.2 big.LITTLE Placement Needs Much Tighter Semantics
Capacity-aware scheduling is necessary but not sufficient. Agent control threads should not bounce between LITTLE and big cores mid-conversation simply because the instantaneous load picture changed. If a control thread is the sink for network and tool completions, it should stay on a home island until the active step window decays, unless there is a compelling thermal or contention reason to move it.
5.3 The Deterministic Scheduling Embodiment
For a more aggressive ARM design, I would create a deterministic scheduling mode for one or more orchestration cores:
per-cluster orchestration core:
reserve a fixed micro-timeslice every T microseconds
admit only latency-marked control threads into that slot
align IRQ steering for marked sockets / queues to the same cluster
hold minimum freq during slot-bearing active windows
disallow deep idle while slot reservation is live
This matters because it converts an otherwise probabilistic wakeup path into a partially time-bounded one. NVIDIA-style host orchestration usually assumes the CPU side is a best-effort feeder for the GPU. A deterministic orchestration slot says the opposite: for agent systems, host-side control latency is part of the product, so we reserve scheduling structure for it.
5.4 cpuidle Should Be Completion-Cadence Aware
The idle-state question is not “how idle is the CPU in aggregate?” but “what is the recent cadence of completions for this latency-marked flow?” If the same control thread has been reactivated every 200 microseconds to 3 milliseconds for the last several steps, deep idle entry is probably a mistake even if aggregate CPU utilization remains low.
6. x86: The Main Enemy Is Topology and Noise
x86 systems usually do not fail because they lack raw speed. They fail because large topologies create too many ways to waste locality. Multiple NUMA nodes, large LLC slices, SMT siblings, broad interrupt spray, background daemons, remote page cache, shootdowns, and housekeeping traffic can make a monster server behave worse than a smaller machine for interactive orchestration.
6.1 NUMA Locality Must Include the Control Plane
Most NUMA guidance is built around throughput threads. Agent runtimes need a different policy: keep the orchestration thread, its frequently touched metadata, its page cache working set, its socket buffers, and its short-lived tool I/O state on the same NUMA home. The thread itself may be light; the penalty of remoteness is not.
control-plane NUMA policy:
choose home node for agent runtime
allocate hot metadata on home node
prefer wakeup CPUs in home node
steer IRQs for relevant queues to home node
keep page cache readahead and socket memory local when possible
6.2 TLB Shootdowns Are an Underappreciated Tax
Agent frameworks often create exactly the kind of memory behavior that produces shootdown noise: short-lived subprocesses, mmap-heavy tools, sandbox remaps, JITs, browser automation sessions, shared-memory buffers, and background memory pressure. On x86, a badly timed TLB shootdown or page-table invalidation can perturb the very control thread you are trying to wake quickly.
I would add step-scoped accounting for:
- tlb_flush events occurring between completion and next issue
- remote IPI traffic attributable to address-space changes in neighboring workloads
- migrations that coincide with page-table refill and cache rewarm events
This is one of those domains where average counters are misleading. You need correlation with agent step boundaries, not just system totals.
6.3 IRQ Isolation Is More Important Than Raw Core Count
On x86, I would much rather have four cores with disciplined IRQ isolation, bounded housekeeping, and stable NUMA locality than sixty-four cores with indiscriminate interrupt delivery. Agentic responsiveness collapses when orchestration CPUs share their immediate neighborhood with unrelated packet floods, storage storms, or timer-heavy background services.
My x86 thesis: once the machine is big enough, the main performance problem is coordination complexity, not arithmetic throughput.
7. Memory Management: Keep the Control Plane Warm
Agent runtimes are not purely CPU-latency problems. They are also memory hygiene problems. The control thread often touches a surprisingly rich working set: token state, parser buffers, JSON trees, tool descriptors, socket state, page cache metadata, sandbox bookkeeping, and policy state for what to do next. None of this is individually large. Together it is fragile.
7.1 What I Would Change
Local allocators for control threads
Small hot allocations for orchestration threads should bias toward node-local, CPU-local caches and avoid incidental cross-node refill where possible.
Reclaim guardrails
Background reclaim and compaction should become less willing to perturb pages touched repeatedly inside active step windows, especially on systems with ample free memory.
Page-cache stability
Tool artifacts, prompt templates, policy files, and other repeatedly opened metadata should be protected from pointless churn caused by broader file-cache pressure.
TLB disturbance accounting
Expose per-cgroup or per-task counters for TLB flushes and page-table disruption during marked orchestration windows.
This is also where future CXL and heterogeneous-memory systems become interesting. Even without redesigning everything around new memory tiers, the kernel should know that some agent state is latency-critical control state and should stay close to CPU caches and low-latency memory.
8. I/O Runtime: io_uring Is Useful, But It Still Needs Kernel Help
Many people see io_uring and assume the problem is solved. It is not. io_uring reduces some userspace/kernel transition overhead and enables efficient async composition, but the completion path still runs through the same underlying locality and scheduling machinery. A CQE delivered after the wrong batching decision or consumed on the wrong CPU is still late.
8.1 What I Would Want from io_uring for Agents
- A latency mode that prioritizes timely CQE visibility over aggressive completion batching.
- Better affinity coupling between submission-side control threads and completion-side wakeups.
- Step-aware observability joining CQE timestamps to scheduler, cpuidle, and cpufreq events.
- Cleaner policy hooks so runtimes can say “this queue belongs to a step-critical orchestration loop.”
desired derived metrics:
cqe_visible_ns - device_complete_ns
wakeup_ns - cqe_visible_ns
oncpu_ns - wakeup_ns
next_issue_ns - oncpu_ns
migrations_in_between
freq_transitions_in_between
tlb_flushes_in_between
9. The Measurement Model
Before changing policy, I would build a measurement pipeline that treats the agent step as the unit of analysis. Linux already exposes most of what is needed through tracepoints, BPF, PMU events, and perf infrastructure.
| Metric | Collection source | Why it matters |
|---|---|---|
| completion_to_runnable_ns | IRQ, softirq, blk-mq, wakeup tracepoints | Measures the boundary cost before the task is runnable |
| runnable_to_oncpu_ns | sched_wakeup + sched_switch | Shows scheduler contention or placement mistakes |
| migrations_per_step | sched_switch + task CPU history | Correlates locality loss with step latency |
| freq_drop_between_steps | cpufreq transitions, scheduler utilization | Captures DVFS overreaction on ARM and sometimes x86 |
| deep_idle_entries_during_active_window | cpuidle tracepoints | Explains wakeup penalties in bursty control loops |
| tlb_flushes_per_step | mm/tlb tracepoints, IPIs, perf | Makes memory-management interference visible |
| remote_irq_ratio | IRQ CPU vs task home CPU | Direct signal of affinity failure |
tracepoints I would join:
irq_handler_entry / irq_handler_exit
softirq_entry / softirq_exit
napi_poll
blk_mq_complete_request
sched_wakeup
sched_switch
mm_tlb_flush_*
cpu_frequency
cpu_idle
io_uring_cqe or task_work related hooks
BPF key:
{agent_id, step_id, task_pid, home_cpu, completion_cpu, wake_cpu}
10. A Concrete Kernel Design: Agent Latency Control Plane
The cleanest way to think about all of this is as a tiny cross-subsystem control plane in the kernel. Not a giant new scheduler. Not a magic governor. A small piece of shared state that says: this task or cgroup is in an active orchestration window, here is its home locality, and here is its latency budget.
struct agent_window {
u64 expires_ns;
u32 agent_id;
s16 home_cpu;
s16 home_node;
u32 latency_budget_us;
u16 flags; /* anti_migrate, shallow_idle, dvfs_floor, deterministic_slot */
};
Subsystem interactions would look like this:
- Userspace runtime marks a control thread or cgroup and renews an active window when issuing or receiving a step-critical operation.
- IRQ or completion paths consult the active window to bias CPU selection and delivery locality.
- The scheduler prefers the home CPU, completion CPU, or same-LLC CPU while the window remains valid.
- cpufreq and cpuidle apply bounded policy overrides only for the relevant CPUs and only while the window remains active.
- Memory management accounts for TLB disturbance and optionally shields hot control-plane pages from needless perturbation.
11. If I Had to Prioritize the Patch Queue
I would not start by changing everything. I would stage the work in a way that makes proof easy.
| Phase | Primary subsystem | Deliverable | Proof of value |
|---|---|---|---|
| 1 | Observability | BPF step tracer joining completion, wakeup, frequency, idle, and TLB events | Tail-latency decomposition per agent step |
| 2 | Scheduler | Home CPU + anti-migration window for marked threads | Lower runnable_to_oncpu and fewer migrations per step |
| 3 | IRQ / completion | Affinity propagation for marked queues and sockets | Lower completion_to_runnable and remote_irq_ratio |
| 4 | ARM power path | Windowed uclamp / schedutil floor + shallow idle bias | Reduced freq_drop_between_steps and idle-exit debt |
| 5 | x86 topology hygiene | NUMA-home control plane + IRQ / housekeeping isolation | Lower remote memory and interrupt noise for orchestration cores |
| 6 | Memory | TLB disturbance accounting and control-plane reclaim guardrails | Reduced long-tail stalls correlated with mm activity |
12. Architecture-Specific Bottom Line
ARM
- Introduce short-lived active step windows integrated with schedutil and cpuidle.
- Keep orchestration threads on a stable big-core or stable cluster home until the window expires.
- Route IRQs and completion work to the same locality island.
- Consider deterministic orchestration slots for the most latency-sensitive deployments.
x86
- Choose a NUMA home for the control plane and keep wakeups local.
- Isolate agent-critical CPUs from indiscriminate interrupt spray and housekeeping.
- Measure and reduce TLB and page-table disturbance within step windows.
- Treat topology discipline as more important than raw core count.
13. Final View
If you ask me what part of the ARM or x86 kernel stack I would change first to boost agentic performance, the answer is not “the model driver” and not “some generic AI acceleration path.” It is the part of Linux that decides what happens between one step and the next. That means wakeup placement, completion locality, power-state continuity, and memory stability.
On ARM, the real bug is often that the machine keeps trying to save power in the middle of an active reasoning conversation. On x86, the real bug is often that the machine has so much topology and background noise that it cannot preserve locality for a tiny but critical control thread. In both cases, the winning design is the same: teach the kernel that short orchestration gaps are not dead time. They are active time with deferred visibility.
The short version: if I had to redesign only three surfaces, I would redesign scheduler wakeups, IRQ-to-userspace completion, and memory-stability accounting, then attach bounded power policy on top. That is where the real step-function gains are.