Abstract
Agentic AI workloads are a pathological case for modern operating systems: short bursts of CPU work, dense syscall activity, frequent file metadata probes, mmap-backed index loads, small random reads, and constant trace/log writes.
On ARM systems, the usual power-management strategy—aggressively saving energy through frequency scaling, big/little placement, deeper idle states, memory power-down, and NVMe low-power transitions—can turn into a latency amplifier. A 200 microsecond frequency ramp, a 500 microsecond idle exit, or a 3 millisecond NVMe wakeup looks small in isolation. Repeated hundreds or thousands of times inside a reasoning/tool loop, it becomes visible user-facing latency.
1. The Agent Loop Is a Kernel Loop
An agent step usually looks like application logic, but the hot path repeatedly crosses kernel boundaries:
The kernel sees a workload with many short-lived bursts, frequent wakeups, small I/O, and little steady-state utilization. Traditional throughput metrics can look fine while the interactive step latency is poor.
2. Why ARM Power Management Is Central
ARM platforms are attractive for agentic infrastructure because they offer strong perf-per-watt and scalable core counts. But the same features that make ARM efficient can create latency variance:
| Mechanism | What it optimizes | Agentic risk |
|---|---|---|
| DVFS / cpufreq | Average energy under varying CPU utilization | Frequency ramp delay after short idle gaps |
| cpuidle | Deeper low-power idle states | Wakeup latency inside request loops |
| Energy-aware scheduling | Energy-efficient task placement | Latency-critical thread placed on a slower core or migrated at a bad time |
| Memory power states | DRAM/LPDDR power reduction | First-touch latency after idle |
| NVMe APST | Controller/flash power reduction | Millisecond-scale wakeups before small reads |
Linux Energy Aware Scheduling explicitly uses a CPU energy model to guide task placement with minimal throughput impact. That is the right general objective, but agentic loops need a second objective: latency stability for the next small step.
3. Latency Amplification Model
Agentic execution multiplies small latencies:
step_latency = scheduler_wakeup
+ cpufreq_ramp
+ idle_exit
+ syscall_path
+ page_faults
+ storage_read
+ parse_time
total_latency = Σ step_latency over N agent steps
With 1,000 steps, a seemingly harmless 250 microsecond wake/ramp penalty contributes 250 milliseconds. Add occasional NVMe APST wakeups and mmap fault storms, and the tail becomes seconds.
4. Kernel Patch Sketches
These are intentionally framed as patch sketches, not drop-in upstream-ready code. The objective is to show where the kernel interface would need to change.
Patch A: latency-critical agent task hint
Add a sched attribute flag that marks a thread as latency-sensitive but not real-time. This avoids abusing SCHED_FIFO or SCHED_DEADLINE for application loops that simply need predictable wakeups.
/* include/uapi/linux/sched/types.h */
#define SCHED_FLAG_AGENT_LATENCY 0x00000100
struct sched_attr {
__u32 size;
__u32 sched_policy;
__u64 sched_flags;
...
};
/* kernel/sched/core.c: sketch */
static inline bool task_agent_latency(const struct task_struct *p)
{
return p->sched_flags & SCHED_FLAG_AGENT_LATENCY;
}
/* Prefer local wakeup, avoid unnecessary migration, bias against deep idle CPUs. */
if (task_agent_latency(p)) {
target = select_latency_stable_cpu(p, prev_cpu);
if (target >= 0)
return target;
}
This is not a replacement for real-time scheduling. It is a hint: “this loop is harmed by wakeup variance; keep it on a warm, suitable CPU when possible.”
Patch B: cpuidle governor guardrail for active agent windows
For short bursty loops, the cpuidle governor may choose deep states because the average idle period looks long enough. The patch below introduces a per-CPU latency guardrail during an active agent window.
/* drivers/cpuidle/governors/menu.c: sketch */
static int agent_limit_idle_state(struct cpuidle_driver *drv, int idx)
{
u64 max_exit_ns = this_cpu_read(agent_idle_exit_budget_ns);
if (!max_exit_ns)
return idx;
while (idx > 0 && drv->states[idx].exit_latency_ns > max_exit_ns)
idx--;
return idx;
}
The key is selectivity: keep the machine efficient normally, but avoid deep idle exits while the agent runtime is inside a latency-critical interaction window.
Patch C: cpufreq boost-on-loop for bursty short work
The schedutil governor already uses scheduler utilization signals. Agentic loops need an additional signal: recent repeated wakeups from the same latency-tagged control loop.
/* kernel/sched/cpufreq_schedutil.c: sketch */
if (task_agent_latency(current) && agent_loop_recent(current)) {
next_f = max(next_f, policy->cpuinfo.max_freq * 80 / 100);
sugov_update_commit(sg_policy, time, next_f);
}
The goal is not “run hot forever.” The goal is to pre-warm frequency during a loop that is known to issue another short step soon.
Patch D: agent-aware page cache protection
Agent runtimes often have a small hot set: prompt templates, tool schemas, vector index headers, embedding shards, repo metadata, and active working files. This hot set competes with logs and transient reads.
/* include/uapi/asm-generic/mman-common.h */
#define MADV_AGENT_HOTSET 104 /* protect against transient eviction */
#define MADV_AGENT_STREAM 105 /* do not pollute hot cache */
/* mm/madvise.c: sketch */
case MADV_AGENT_HOTSET:
mark_vma_agent_hotset(vma);
break;
case MADV_AGENT_STREAM:
mark_vma_agent_stream(vma);
break;
This creates a vocabulary for the kernel: some memory is loop-critical; some memory is transient and should not evict the loop-critical set.
Patch E: latency-first small I/O scheduling in blk-mq
blk-mq is designed to exploit parallelism in modern storage devices. Agent workloads often issue bursts of small reads where tail latency matters more than bulk throughput.
/* block/blk-mq.c: sketch */
static inline bool rq_agent_small_read(struct request *rq)
{
return (rq->cmd_flags & REQ_AGENT_LATENCY) &&
rq_data_dir(rq) == READ &&
blk_rq_bytes(rq) <= 64 * 1024;
}
if (rq_agent_small_read(rq))
blk_mq_insert_latency_head(rq);
else
blk_mq_insert_normal(rq);
This does not mean starving background work. It means the block layer should recognize bursty, request-path small reads as different from bulk scans or log writes.
Patch F: NVMe APST latency budget for agent processes
NVMe Autonomous Power State Transition is useful, but a drive waking from a low-power state can dominate a small read. A latency-tagged cgroup could temporarily request a tighter device power-state latency budget.
/* drivers/nvme/host/core.c: sketch */
static void nvme_apply_agent_latency_budget(struct nvme_ctrl *ctrl, u32 budget_us)
{
/* Recompute APST table to avoid states whose exit latency exceeds budget. */
nvme_configure_apst(ctrl, budget_us);
}
The important point is scope. Do not globally disable power saving. Instead, bind device power behavior to active latency windows.
5. eBPF Tooling: Measuring the Path
Before proposing policy, measure the penalties. A minimal toolkit should answer:
- How often did the agent thread wake up, and what was wakeup latency?
- How many major/minor page faults occurred during each step?
- Which files caused mmap faults?
- Were small reads delayed in blk-mq?
- Did NVMe low-power transitions correlate with read latency?
Example: bpftrace wakeup latency probe
#!/usr/bin/env bpftrace
tracepoint:sched:sched_wakeup
/comm == "agent-runtime"/
{
@wake[args->pid] = nsecs;
}
tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
@wakeup_us = hist((nsecs - @wake[args->next_pid]) / 1000);
delete(@wake[args->next_pid]);
}
Example: page fault attribution
#!/usr/bin/env bpftrace
tracepoint:exceptions:page_fault_user
/comm == "agent-runtime"/
{
@faults[pid] = count();
}
interval:s:5
{
print(@faults);
clear(@faults);
}
Example: block I/O latency for small reads
#!/usr/bin/env bpftrace
tracepoint:block:block_rq_issue
/args->bytes <= 65536/
{
@start[args->dev, args->sector] = nsecs;
}
tracepoint:block:block_rq_complete
/@start[args->dev, args->sector]/
{
$lat_us = (nsecs - @start[args->dev, args->sector]) / 1000;
@small_read_us = hist($lat_us);
delete(@start[args->dev, args->sector]);
}
Example: CPU frequency and idle-state trace
#!/usr/bin/env bpftrace
tracepoint:power:cpu_frequency
{
@freq[args->cpu_id] = args->state;
}
tracepoint:power:cpu_idle
{
@idle_state[args->cpu_id, args->state] = count();
}
6. Proposed Architecture: Agent-Aware Power Policy
The runtime should not micromanage every kernel subsystem. Instead, it should communicate intent: “for the next interaction window, this process is latency-critical; keep the execution path warm.”
/* userspace sketch */
struct agent_latency_window w = {
.step_id = step_id,
.duration_us = 250000,
.max_idle_exit_us = 50,
.min_cpu_capacity_pct = 80,
.nvme_exit_latency_us = 100,
};
ioctl(agent_fd, AGENT_IOC_BEGIN_LATENCY_WINDOW, &w);
run_agent_step();
ioctl(agent_fd, AGENT_IOC_END_LATENCY_WINDOW, &w);
The policy must be bounded and auditable. Otherwise every application will call itself latency-critical and destroy system-wide efficiency.
7. Control Plane: Cgroup-Budgeted Agent Latency
The weakest version of an agent-aware kernel is one where every process can simply declare itself latency-critical. That design fails immediately: it invites hint abuse, burns power, and creates unfairness across tenants. A deployable design needs a budgeted control plane.
The kernel should treat agent latency hints like a scarce resource: granted per cgroup, refilled over time, automatically expired, and visible through tracepoints.
/* include/linux/agent_latency.h: sketch */
struct agent_latency_budget {
u64 max_tokens_ns; /* max low-latency privilege */
u64 tokens_ns; /* current budget */
u64 refill_rate_ns; /* refill per accounting period */
u64 last_refill_ns;
};
/* kernel/sched/core.c: sketch */
static bool agent_latency_try_charge(struct task_struct *p, u64 runtime_ns)
{
struct agent_latency_budget *b = task_agent_budget(p);
agent_latency_refill(b, ktime_get_ns());
if (b->tokens_ns < runtime_ns)
return false;
b->tokens_ns -= runtime_ns;
trace_agent_budget_charge(task_pid_nr(p), runtime_ns, b->tokens_ns);
return true;
}
if (task_agent_latency(p)) {
if (!agent_latency_try_charge(p, predicted_window_ns))
clear_agent_latency_effective(p); /* hint becomes advisory only */
}
This design also creates a natural user-space contract: an agent runtime may ask for a latency window, but the kernel decides how much of that request is honored based on cgroup policy, thermal pressure, and system fairness.
8. Memory Hierarchy: Hotsets, SLC, and Memory-Side Cache Pressure
Agent workloads are not merely CPU-bound. They are memory-hierarchy sensitive. A small active set—tool schemas, prompt templates, vector index headers, retrieval metadata, repository maps, tokenizer tables, and active documents—may be touched repeatedly across steps. But the kernel usually sees only anonymous pages, file-backed mappings, and page-cache pressure.
On ARM systems with a shared last-level cache, System Level Cache, or memory-side cache, preserving the agent hotset can matter as much as raw CPU frequency. The wrong eviction decision converts a tight loop into mmap faults, cache refills, and storage reads.
/* include/uapi/asm-generic/mman-common.h: sketch */
#define MADV_AGENT_HOTSET 104 /* protect loop-critical data */
#define MADV_AGENT_STREAM 105 /* transient reads; avoid cache pollution */
#define MADV_AGENT_INDEX 106 /* mmap-backed index with sparse reuse */
/* mm/vmscan.c: sketch */
static bool folio_agent_protected(struct folio *folio)
{
struct vm_area_struct *vma = folio_mapped_vma(folio);
return vma && (vma->vm_flags & VM_AGENT_HOTSET) &&
memcg_agent_budget_available(vma->vm_mm);
}
if (folio_agent_protected(folio))
scan_resistance += AGENT_HOTSET_BONUS;
The policy should remain conservative. A hotset hint should not make pages unevictable; it should bias reclaim and readahead while the cgroup has latency budget remaining.
9. Beyond the Core: NoC / Mesh Interconnect Warmth
Modern ARM server and client SoCs are not just collections of cores. They include mesh fabrics, coherent interconnects, memory controllers, LLC/SLC slices, and device paths whose own power states can add latency. A core may be awake and at frequency while the path to memory or storage has downclocked.
This means the latency window should not stop at cpufreq and cpuidle. For the most latency-sensitive windows, a platform driver may need to translate the same agent signal into interconnect QoS or minimum bandwidth votes.
/* drivers/interconnect/agent_qos.c: sketch */
static int agent_icc_apply_window(struct agent_latency_window *w)
{
if (!w->active || thermal_pressure_high())
return 0;
icc_set_bw(agent_path_cpu_mem,
w->min_interconnect_avg_bw,
w->min_interconnect_peak_bw);
trace_agent_interconnect_vote(w->step_id,
w->min_interconnect_peak_bw);
return 0;
}
/* Auto-expire when the latency window ends. */
static void agent_icc_end_window(struct agent_latency_window *w)
{
icc_set_bw(agent_path_cpu_mem, 0, 0);
}
This is where an ARM-specific agentic CPU story becomes interesting: not “disable power management,” but coordinate CPU, idle, memory, storage, and interconnect policies for short, bounded windows.
10. Evaluation Plan
A credible benchmark should not report only average throughput. It should report distributions and attribution:
| Metric | Why it matters | Expected signal |
|---|---|---|
| p50/p95/p99 step latency | User-visible latency lives in the tail | Lower p99 under latency window |
| Scheduler wakeup histogram | Shows whether CPU placement improved | Fewer >500µs wakeups |
| cpuidle state residency | Verifies deep idle avoidance during active loops | Fewer deep state entries only during windows |
| cpufreq ramp events | Shows frequency stability | Reduced low-frequency first-step penalty |
| Small-read p99 | Captures vector/index/document lookup stalls | Lower tail for 4KB–64KB reads |
| Energy per completed task | Prevents performance-only policy from wasting power | Small energy increase for meaningful tail reduction |
11. Risks and Upstream Reality
There are obvious objections:
- Hint abuse: every runtime may mark itself latency-critical.
- Fairness: keeping cores warm for one cgroup can hurt others.
- Energy regression: bad policy can destroy the reason to use efficient CPUs.
- Device diversity: NVMe and ARM platform behavior varies widely.
The answer is to scope policies by cgroup, enforce budgets, expire hints automatically, expose tracepoints, and require benchmarks that include energy per task. The kernel should not blindly trust the application; it should accept intent and enforce policy.
Conclusion
Agentic AI workloads are revealing a systems problem that was easy to miss in batch-centric AI: the latency path is the operating system. On ARM CPUs, power management is a first-class part of that path.
The right goal is not to disable power saving. The right goal is latency-aware efficiency: keep the system efficient across the day, but keep the agent loop warm during the small windows where wakeups, page faults, and storage reads compound into visible delay.
References
- Linux Kernel Documentation: Energy Aware Scheduling.
- Linux Kernel Documentation: CPU Idle Time Management.
- Linux Kernel Documentation: Multi-Queue Block IO Queueing Mechanism.
- Linux manual pages: io_uring registered buffers.
- NVM Express: NVMe Technology Power Management Features.