ARM Kernel Power Management for Agentic AI Workloads

Abstract

Agentic AI workloads are a pathological case for modern operating systems: short bursts of CPU work, dense syscall activity, frequent file metadata probes, mmap-backed index loads, small random reads, and constant trace/log writes.

On ARM systems, the usual power-management strategy—aggressively saving energy through frequency scaling, big/little placement, deeper idle states, memory power-down, and NVMe low-power transitions—can turn into a latency amplifier. A 200 microsecond frequency ramp, a 500 microsecond idle exit, or a 3 millisecond NVMe wakeup looks small in isolation. Repeated hundreds or thousands of times inside a reasoning/tool loop, it becomes visible user-facing latency.

Thesis: for agentic AI, power management is not a background efficiency feature. It is part of the request latency path.

1. The Agent Loop Is a Kernel Loop

An agent step usually looks like application logic, but the hot path repeatedly crosses kernel boundaries:

Figure 1. Agentic workloads repeatedly cross VFS, memory-management, scheduler, and block-layer paths. The latency of each crossing is amplified by loop count.

The kernel sees a workload with many short-lived bursts, frequent wakeups, small I/O, and little steady-state utilization. Traditional throughput metrics can look fine while the interactive step latency is poor.

2. Why ARM Power Management Is Central

ARM platforms are attractive for agentic infrastructure because they offer strong perf-per-watt and scalable core counts. But the same features that make ARM efficient can create latency variance:

Mechanism	What it optimizes	Agentic risk
DVFS / cpufreq	Average energy under varying CPU utilization	Frequency ramp delay after short idle gaps
cpuidle	Deeper low-power idle states	Wakeup latency inside request loops
Energy-aware scheduling	Energy-efficient task placement	Latency-critical thread placed on a slower core or migrated at a bad time
Memory power states	DRAM/LPDDR power reduction	First-touch latency after idle
NVMe APST	Controller/flash power reduction	Millisecond-scale wakeups before small reads

Linux Energy Aware Scheduling explicitly uses a CPU energy model to guide task placement with minimal throughput impact. That is the right general objective, but agentic loops need a second objective: latency stability for the next small step.

Design tension: low average utilization makes the system look idle; but the next agent step may be latency-critical.

3. Latency Amplification Model

Agentic execution multiplies small latencies:

step_latency = scheduler_wakeup
             + cpufreq_ramp
             + idle_exit
             + syscall_path
             + page_faults
             + storage_read
             + parse_time

total_latency = Σ step_latency over N agent steps

With 1,000 steps, a seemingly harmless 250 microsecond wake/ramp penalty contributes 250 milliseconds. Add occasional NVMe APST wakeups and mmap fault storms, and the tail becomes seconds.

Figure 2. The agentic workload converts power-management and I/O micro-latencies into visible end-to-end delay.

4. Kernel Patch Sketches

These are intentionally framed as patch sketches, not drop-in upstream-ready code. The objective is to show where the kernel interface would need to change.

Patch A: latency-critical agent task hint

Add a sched attribute flag that marks a thread as latency-sensitive but not real-time. This avoids abusing SCHED_FIFO or SCHED_DEADLINE for application loops that simply need predictable wakeups.

/* include/uapi/linux/sched/types.h */
#define SCHED_FLAG_AGENT_LATENCY   0x00000100

struct sched_attr {
        __u32 size;
        __u32 sched_policy;
        __u64 sched_flags;
        ...
};

/* kernel/sched/core.c: sketch */
static inline bool task_agent_latency(const struct task_struct *p)
{
        return p->sched_flags & SCHED_FLAG_AGENT_LATENCY;
}

/* Prefer local wakeup, avoid unnecessary migration, bias against deep idle CPUs. */
if (task_agent_latency(p)) {
        target = select_latency_stable_cpu(p, prev_cpu);
        if (target >= 0)
                return target;
}

This is not a replacement for real-time scheduling. It is a hint: “this loop is harmed by wakeup variance; keep it on a warm, suitable CPU when possible.”

Patch B: cpuidle governor guardrail for active agent windows

For short bursty loops, the cpuidle governor may choose deep states because the average idle period looks long enough. The patch below introduces a per-CPU latency guardrail during an active agent window.

/* drivers/cpuidle/governors/menu.c: sketch */
static int agent_limit_idle_state(struct cpuidle_driver *drv, int idx)
{
        u64 max_exit_ns = this_cpu_read(agent_idle_exit_budget_ns);

        if (!max_exit_ns)
                return idx;

        while (idx > 0 && drv->states[idx].exit_latency_ns > max_exit_ns)
                idx--;

        return idx;
}

The key is selectivity: keep the machine efficient normally, but avoid deep idle exits while the agent runtime is inside a latency-critical interaction window.

Patch C: cpufreq boost-on-loop for bursty short work

The schedutil governor already uses scheduler utilization signals. Agentic loops need an additional signal: recent repeated wakeups from the same latency-tagged control loop.

/* kernel/sched/cpufreq_schedutil.c: sketch */
if (task_agent_latency(current) && agent_loop_recent(current)) {
        next_f = max(next_f, policy->cpuinfo.max_freq * 80 / 100);
        sugov_update_commit(sg_policy, time, next_f);
}

The goal is not “run hot forever.” The goal is to pre-warm frequency during a loop that is known to issue another short step soon.

Patch D: agent-aware page cache protection

Agent runtimes often have a small hot set: prompt templates, tool schemas, vector index headers, embedding shards, repo metadata, and active working files. This hot set competes with logs and transient reads.

/* include/uapi/asm-generic/mman-common.h */
#define MADV_AGENT_HOTSET  104   /* protect against transient eviction */
#define MADV_AGENT_STREAM  105   /* do not pollute hot cache */

/* mm/madvise.c: sketch */
case MADV_AGENT_HOTSET:
        mark_vma_agent_hotset(vma);
        break;
case MADV_AGENT_STREAM:
        mark_vma_agent_stream(vma);
        break;

This creates a vocabulary for the kernel: some memory is loop-critical; some memory is transient and should not evict the loop-critical set.

Patch E: latency-first small I/O scheduling in blk-mq

blk-mq is designed to exploit parallelism in modern storage devices. Agent workloads often issue bursts of small reads where tail latency matters more than bulk throughput.

/* block/blk-mq.c: sketch */
static inline bool rq_agent_small_read(struct request *rq)
{
        return (rq->cmd_flags & REQ_AGENT_LATENCY) &&
               rq_data_dir(rq) == READ &&
               blk_rq_bytes(rq) <= 64 * 1024;
}

if (rq_agent_small_read(rq))
        blk_mq_insert_latency_head(rq);
else
        blk_mq_insert_normal(rq);

This does not mean starving background work. It means the block layer should recognize bursty, request-path small reads as different from bulk scans or log writes.

Patch F: NVMe APST latency budget for agent processes

NVMe Autonomous Power State Transition is useful, but a drive waking from a low-power state can dominate a small read. A latency-tagged cgroup could temporarily request a tighter device power-state latency budget.

/* drivers/nvme/host/core.c: sketch */
static void nvme_apply_agent_latency_budget(struct nvme_ctrl *ctrl, u32 budget_us)
{
        /* Recompute APST table to avoid states whose exit latency exceeds budget. */
        nvme_configure_apst(ctrl, budget_us);
}

The important point is scope. Do not globally disable power saving. Instead, bind device power behavior to active latency windows.

5. eBPF Tooling: Measuring the Path

Before proposing policy, measure the penalties. A minimal toolkit should answer:

How often did the agent thread wake up, and what was wakeup latency?
How many major/minor page faults occurred during each step?
Which files caused mmap faults?
Were small reads delayed in blk-mq?
Did NVMe low-power transitions correlate with read latency?

Figure 3. The practical first step is attribution: tie scheduler, memory, block, and power events to an agent step timeline.

Example: bpftrace wakeup latency probe

#!/usr/bin/env bpftrace

tracepoint:sched:sched_wakeup
/comm == "agent-runtime"/
{
  @wake[args->pid] = nsecs;
}

tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
  @wakeup_us = hist((nsecs - @wake[args->next_pid]) / 1000);
  delete(@wake[args->next_pid]);
}

Example: page fault attribution

#!/usr/bin/env bpftrace

tracepoint:exceptions:page_fault_user
/comm == "agent-runtime"/
{
  @faults[pid] = count();
}

interval:s:5
{
  print(@faults);
  clear(@faults);
}

Example: block I/O latency for small reads

#!/usr/bin/env bpftrace

tracepoint:block:block_rq_issue
/args->bytes <= 65536/
{
  @start[args->dev, args->sector] = nsecs;
}

tracepoint:block:block_rq_complete
/@start[args->dev, args->sector]/
{
  $lat_us = (nsecs - @start[args->dev, args->sector]) / 1000;
  @small_read_us = hist($lat_us);
  delete(@start[args->dev, args->sector]);
}

Example: CPU frequency and idle-state trace

#!/usr/bin/env bpftrace

tracepoint:power:cpu_frequency
{
  @freq[args->cpu_id] = args->state;
}

tracepoint:power:cpu_idle
{
  @idle_state[args->cpu_id, args->state] = count();
}

6. Proposed Architecture: Agent-Aware Power Policy

The runtime should not micromanage every kernel subsystem. Instead, it should communicate intent: “for the next interaction window, this process is latency-critical; keep the execution path warm.”

Figure 4. A runtime-level latency window can become a bounded kernel policy that touches scheduler, frequency, idle, memory, and block behavior.

/* userspace sketch */
struct agent_latency_window w = {
    .step_id = step_id,
    .duration_us = 250000,
    .max_idle_exit_us = 50,
    .min_cpu_capacity_pct = 80,
    .nvme_exit_latency_us = 100,
};

ioctl(agent_fd, AGENT_IOC_BEGIN_LATENCY_WINDOW, &w);
run_agent_step();
ioctl(agent_fd, AGENT_IOC_END_LATENCY_WINDOW, &w);

The policy must be bounded and auditable. Otherwise every application will call itself latency-critical and destroy system-wide efficiency.

7. Control Plane: Cgroup-Budgeted Agent Latency

The weakest version of an agent-aware kernel is one where every process can simply declare itself latency-critical. That design fails immediately: it invites hint abuse, burns power, and creates unfairness across tenants. A deployable design needs a budgeted control plane.

The kernel should treat agent latency hints like a scarce resource: granted per cgroup, refilled over time, automatically expired, and visible through tracepoints.

Figure 5. Agent hints need kernel enforcement. A cgroup token bucket prevents every process from declaring itself latency-critical indefinitely.

/* include/linux/agent_latency.h: sketch */
struct agent_latency_budget {
        u64 max_tokens_ns;      /* max low-latency privilege */
        u64 tokens_ns;          /* current budget */
        u64 refill_rate_ns;     /* refill per accounting period */
        u64 last_refill_ns;
};

/* kernel/sched/core.c: sketch */
static bool agent_latency_try_charge(struct task_struct *p, u64 runtime_ns)
{
        struct agent_latency_budget *b = task_agent_budget(p);

        agent_latency_refill(b, ktime_get_ns());
        if (b->tokens_ns < runtime_ns)
                return false;

        b->tokens_ns -= runtime_ns;
        trace_agent_budget_charge(task_pid_nr(p), runtime_ns, b->tokens_ns);
        return true;
}

if (task_agent_latency(p)) {
        if (!agent_latency_try_charge(p, predicted_window_ns))
                clear_agent_latency_effective(p); /* hint becomes advisory only */
}

This design also creates a natural user-space contract: an agent runtime may ask for a latency window, but the kernel decides how much of that request is honored based on cgroup policy, thermal pressure, and system fairness.

8. Memory Hierarchy: Hotsets, SLC, and Memory-Side Cache Pressure

Agent workloads are not merely CPU-bound. They are memory-hierarchy sensitive. A small active set—tool schemas, prompt templates, vector index headers, retrieval metadata, repository maps, tokenizer tables, and active documents—may be touched repeatedly across steps. But the kernel usually sees only anonymous pages, file-backed mappings, and page-cache pressure.

On ARM systems with a shared last-level cache, System Level Cache, or memory-side cache, preserving the agent hotset can matter as much as raw CPU frequency. The wrong eviction decision converts a tight loop into mmap faults, cache refills, and storage reads.

Figure 6. Agent runtimes need a vocabulary for memory intent: protect the hotset, stream transient data, and avoid page-cache pollution.

/* include/uapi/asm-generic/mman-common.h: sketch */
#define MADV_AGENT_HOTSET        104 /* protect loop-critical data */
#define MADV_AGENT_STREAM        105 /* transient reads; avoid cache pollution */
#define MADV_AGENT_INDEX         106 /* mmap-backed index with sparse reuse */

/* mm/vmscan.c: sketch */
static bool folio_agent_protected(struct folio *folio)
{
        struct vm_area_struct *vma = folio_mapped_vma(folio);

        return vma && (vma->vm_flags & VM_AGENT_HOTSET) &&
               memcg_agent_budget_available(vma->vm_mm);
}

if (folio_agent_protected(folio))
        scan_resistance += AGENT_HOTSET_BONUS;

The policy should remain conservative. A hotset hint should not make pages unevictable; it should bias reclaim and readahead while the cgroup has latency budget remaining.

9. Beyond the Core: NoC / Mesh Interconnect Warmth

Modern ARM server and client SoCs are not just collections of cores. They include mesh fabrics, coherent interconnects, memory controllers, LLC/SLC slices, and device paths whose own power states can add latency. A core may be awake and at frequency while the path to memory or storage has downclocked.

This means the latency window should not stop at cpufreq and cpuidle. For the most latency-sensitive windows, a platform driver may need to translate the same agent signal into interconnect QoS or minimum bandwidth votes.

Figure 7. A complete latency window may need to keep the interconnect and memory path warm, not only the CPU core.

/* drivers/interconnect/agent_qos.c: sketch */
static int agent_icc_apply_window(struct agent_latency_window *w)
{
        if (!w->active || thermal_pressure_high())
                return 0;

        icc_set_bw(agent_path_cpu_mem,
                   w->min_interconnect_avg_bw,
                   w->min_interconnect_peak_bw);

        trace_agent_interconnect_vote(w->step_id,
                                      w->min_interconnect_peak_bw);
        return 0;
}

/* Auto-expire when the latency window ends. */
static void agent_icc_end_window(struct agent_latency_window *w)
{
        icc_set_bw(agent_path_cpu_mem, 0, 0);
}

This is where an ARM-specific agentic CPU story becomes interesting: not “disable power management,” but coordinate CPU, idle, memory, storage, and interconnect policies for short, bounded windows.

10. Evaluation Plan

A credible benchmark should not report only average throughput. It should report distributions and attribution:

Metric	Why it matters	Expected signal
p50/p95/p99 step latency	User-visible latency lives in the tail	Lower p99 under latency window
Scheduler wakeup histogram	Shows whether CPU placement improved	Fewer >500µs wakeups
cpuidle state residency	Verifies deep idle avoidance during active loops	Fewer deep state entries only during windows
cpufreq ramp events	Shows frequency stability	Reduced low-frequency first-step penalty
Small-read p99	Captures vector/index/document lookup stalls	Lower tail for 4KB–64KB reads
Energy per completed task	Prevents performance-only policy from wasting power	Small energy increase for meaningful tail reduction

Success criterion: improve p99/p999 step latency without globally disabling power management.

11. Risks and Upstream Reality

There are obvious objections:

Hint abuse: every runtime may mark itself latency-critical.
Fairness: keeping cores warm for one cgroup can hurt others.
Energy regression: bad policy can destroy the reason to use efficient CPUs.
Device diversity: NVMe and ARM platform behavior varies widely.

The answer is to scope policies by cgroup, enforce budgets, expire hints automatically, expose tracepoints, and require benchmarks that include energy per task. The kernel should not blindly trust the application; it should accept intent and enforce policy.

Conclusion

Agentic AI workloads are revealing a systems problem that was easy to miss in batch-centric AI: the latency path is the operating system. On ARM CPUs, power management is a first-class part of that path.

The right goal is not to disable power saving. The right goal is latency-aware efficiency: keep the system efficient across the day, but keep the agent loop warm during the small windows where wakeups, page faults, and storage reads compound into visible delay.

One-line takeaway: ARM agentic CPUs need power policies that optimize not just watts per operation, but variance per agent step.

References

Linux Kernel Documentation: Energy Aware Scheduling.
Linux Kernel Documentation: CPU Idle Time Management.
Linux Kernel Documentation: Multi-Queue Block IO Queueing Mechanism.
Linux manual pages: io_uring registered buffers.
NVM Express: NVMe Technology Power Management Features.