USENIX-style systems note · ARM Linux · Agentic workloads

Kernel Power Management for Agentic AI Workloads on ARM CPUs

Agentic systems do not just need efficient cores. They need latency-stable cores, predictable memory behavior, and storage devices that do not disappear into low-power states between tiny bursts of work.

14 min read CPU schedulingDVFS / cpufreqcpuidlepage cacheNVMe APSTeBPF tracingcgroup budgetsNoC QoS

Abstract

Agentic AI workloads are a pathological case for modern operating systems: short bursts of CPU work, dense syscall activity, frequent file metadata probes, mmap-backed index loads, small random reads, and constant trace/log writes.

On ARM systems, the usual power-management strategy—aggressively saving energy through frequency scaling, big/little placement, deeper idle states, memory power-down, and NVMe low-power transitions—can turn into a latency amplifier. A 200 microsecond frequency ramp, a 500 microsecond idle exit, or a 3 millisecond NVMe wakeup looks small in isolation. Repeated hundreds or thousands of times inside a reasoning/tool loop, it becomes visible user-facing latency.

Thesis: for agentic AI, power management is not a background efficiency feature. It is part of the request latency path.

1. The Agent Loop Is a Kernel Loop

An agent step usually looks like application logic, but the hot path repeatedly crosses kernel boundaries:

Agent step → kernel subsystems → latency returned to loop reasonplan/tool choice syscallsopen/read/stat memoryfault/cache storagesmall I/O parseJSON/output repeat hundreds to thousands of times
Figure 1. Agentic workloads repeatedly cross VFS, memory-management, scheduler, and block-layer paths. The latency of each crossing is amplified by loop count.

The kernel sees a workload with many short-lived bursts, frequent wakeups, small I/O, and little steady-state utilization. Traditional throughput metrics can look fine while the interactive step latency is poor.

2. Why ARM Power Management Is Central

ARM platforms are attractive for agentic infrastructure because they offer strong perf-per-watt and scalable core counts. But the same features that make ARM efficient can create latency variance:

MechanismWhat it optimizesAgentic risk
DVFS / cpufreqAverage energy under varying CPU utilizationFrequency ramp delay after short idle gaps
cpuidleDeeper low-power idle statesWakeup latency inside request loops
Energy-aware schedulingEnergy-efficient task placementLatency-critical thread placed on a slower core or migrated at a bad time
Memory power statesDRAM/LPDDR power reductionFirst-touch latency after idle
NVMe APSTController/flash power reductionMillisecond-scale wakeups before small reads

Linux Energy Aware Scheduling explicitly uses a CPU energy model to guide task placement with minimal throughput impact. That is the right general objective, but agentic loops need a second objective: latency stability for the next small step.

Design tension: low average utilization makes the system look idle; but the next agent step may be latency-critical.

3. Latency Amplification Model

Agentic execution multiplies small latencies:

step_latency = scheduler_wakeup
             + cpufreq_ramp
             + idle_exit
             + syscall_path
             + page_faults
             + storage_read
             + parse_time

total_latency = Σ step_latency over N agent steps

With 1,000 steps, a seemingly harmless 250 microsecond wake/ramp penalty contributes 250 milliseconds. Add occasional NVMe APST wakeups and mmap fault storms, and the tail becomes seconds.

Small penalties become visible latency after loop amplification 250µswakeup + 300µsDVFS + 600µsfaults + 2msSSD wake × 1000steps micro-latencyuser-visible delay
Figure 2. The agentic workload converts power-management and I/O micro-latencies into visible end-to-end delay.

4. Kernel Patch Sketches

These are intentionally framed as patch sketches, not drop-in upstream-ready code. The objective is to show where the kernel interface would need to change.

Patch A: latency-critical agent task hint

Add a sched attribute flag that marks a thread as latency-sensitive but not real-time. This avoids abusing SCHED_FIFO or SCHED_DEADLINE for application loops that simply need predictable wakeups.

/* include/uapi/linux/sched/types.h */
#define SCHED_FLAG_AGENT_LATENCY   0x00000100

struct sched_attr {
        __u32 size;
        __u32 sched_policy;
        __u64 sched_flags;
        ...
};
/* kernel/sched/core.c: sketch */
static inline bool task_agent_latency(const struct task_struct *p)
{
        return p->sched_flags & SCHED_FLAG_AGENT_LATENCY;
}

/* Prefer local wakeup, avoid unnecessary migration, bias against deep idle CPUs. */
if (task_agent_latency(p)) {
        target = select_latency_stable_cpu(p, prev_cpu);
        if (target >= 0)
                return target;
}

This is not a replacement for real-time scheduling. It is a hint: “this loop is harmed by wakeup variance; keep it on a warm, suitable CPU when possible.”

Patch B: cpuidle governor guardrail for active agent windows

For short bursty loops, the cpuidle governor may choose deep states because the average idle period looks long enough. The patch below introduces a per-CPU latency guardrail during an active agent window.

/* drivers/cpuidle/governors/menu.c: sketch */
static int agent_limit_idle_state(struct cpuidle_driver *drv, int idx)
{
        u64 max_exit_ns = this_cpu_read(agent_idle_exit_budget_ns);

        if (!max_exit_ns)
                return idx;

        while (idx > 0 && drv->states[idx].exit_latency_ns > max_exit_ns)
                idx--;

        return idx;
}

The key is selectivity: keep the machine efficient normally, but avoid deep idle exits while the agent runtime is inside a latency-critical interaction window.

Patch C: cpufreq boost-on-loop for bursty short work

The schedutil governor already uses scheduler utilization signals. Agentic loops need an additional signal: recent repeated wakeups from the same latency-tagged control loop.

/* kernel/sched/cpufreq_schedutil.c: sketch */
if (task_agent_latency(current) && agent_loop_recent(current)) {
        next_f = max(next_f, policy->cpuinfo.max_freq * 80 / 100);
        sugov_update_commit(sg_policy, time, next_f);
}

The goal is not “run hot forever.” The goal is to pre-warm frequency during a loop that is known to issue another short step soon.

Patch D: agent-aware page cache protection

Agent runtimes often have a small hot set: prompt templates, tool schemas, vector index headers, embedding shards, repo metadata, and active working files. This hot set competes with logs and transient reads.

/* include/uapi/asm-generic/mman-common.h */
#define MADV_AGENT_HOTSET  104   /* protect against transient eviction */
#define MADV_AGENT_STREAM  105   /* do not pollute hot cache */
/* mm/madvise.c: sketch */
case MADV_AGENT_HOTSET:
        mark_vma_agent_hotset(vma);
        break;
case MADV_AGENT_STREAM:
        mark_vma_agent_stream(vma);
        break;

This creates a vocabulary for the kernel: some memory is loop-critical; some memory is transient and should not evict the loop-critical set.

Patch E: latency-first small I/O scheduling in blk-mq

blk-mq is designed to exploit parallelism in modern storage devices. Agent workloads often issue bursts of small reads where tail latency matters more than bulk throughput.

/* block/blk-mq.c: sketch */
static inline bool rq_agent_small_read(struct request *rq)
{
        return (rq->cmd_flags & REQ_AGENT_LATENCY) &&
               rq_data_dir(rq) == READ &&
               blk_rq_bytes(rq) <= 64 * 1024;
}

if (rq_agent_small_read(rq))
        blk_mq_insert_latency_head(rq);
else
        blk_mq_insert_normal(rq);

This does not mean starving background work. It means the block layer should recognize bursty, request-path small reads as different from bulk scans or log writes.

Patch F: NVMe APST latency budget for agent processes

NVMe Autonomous Power State Transition is useful, but a drive waking from a low-power state can dominate a small read. A latency-tagged cgroup could temporarily request a tighter device power-state latency budget.

/* drivers/nvme/host/core.c: sketch */
static void nvme_apply_agent_latency_budget(struct nvme_ctrl *ctrl, u32 budget_us)
{
        /* Recompute APST table to avoid states whose exit latency exceeds budget. */
        nvme_configure_apst(ctrl, budget_us);
}

The important point is scope. Do not globally disable power saving. Instead, bind device power behavior to active latency windows.

5. eBPF Tooling: Measuring the Path

Before proposing policy, measure the penalties. A minimal toolkit should answer:

  • How often did the agent thread wake up, and what was wakeup latency?
  • How many major/minor page faults occurred during each step?
  • Which files caused mmap faults?
  • Were small reads delayed in blk-mq?
  • Did NVMe low-power transitions correlate with read latency?
eBPF attribution pipeline schedwakeup latency mmfault latency blocksmall I/O p99 powerfreq/idle state steptimeline Attach to tracepoints: sched_switch, sched_wakeup, mm_filemap_fault, block_rq_issue, block_rq_complete, power/cpu_frequency, power/cpu_idle.
Figure 3. The practical first step is attribution: tie scheduler, memory, block, and power events to an agent step timeline.

Example: bpftrace wakeup latency probe

#!/usr/bin/env bpftrace

tracepoint:sched:sched_wakeup
/comm == "agent-runtime"/
{
  @wake[args->pid] = nsecs;
}

tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
  @wakeup_us = hist((nsecs - @wake[args->next_pid]) / 1000);
  delete(@wake[args->next_pid]);
}

Example: page fault attribution

#!/usr/bin/env bpftrace

tracepoint:exceptions:page_fault_user
/comm == "agent-runtime"/
{
  @faults[pid] = count();
}

interval:s:5
{
  print(@faults);
  clear(@faults);
}

Example: block I/O latency for small reads

#!/usr/bin/env bpftrace

tracepoint:block:block_rq_issue
/args->bytes <= 65536/
{
  @start[args->dev, args->sector] = nsecs;
}

tracepoint:block:block_rq_complete
/@start[args->dev, args->sector]/
{
  $lat_us = (nsecs - @start[args->dev, args->sector]) / 1000;
  @small_read_us = hist($lat_us);
  delete(@start[args->dev, args->sector]);
}

Example: CPU frequency and idle-state trace

#!/usr/bin/env bpftrace

tracepoint:power:cpu_frequency
{
  @freq[args->cpu_id] = args->state;
}

tracepoint:power:cpu_idle
{
  @idle_state[args->cpu_id, args->state] = count();
}

6. Proposed Architecture: Agent-Aware Power Policy

The runtime should not micromanage every kernel subsystem. Instead, it should communicate intent: “for the next interaction window, this process is latency-critical; keep the execution path warm.”

Agent-aware power policy Agent runtime step_idlatency windowhotset hintssmall-I/O hints Kernel policy broker validates hint scopeapplies cgroup budgetemits trace eventsexpires automatically scheduler cpuidle cpufreq NVMe / blk Intent flows from runtime to kernel. Enforcement stays in kernel policy, with automatic expiration and traceability.
Figure 4. A runtime-level latency window can become a bounded kernel policy that touches scheduler, frequency, idle, memory, and block behavior.
/* userspace sketch */
struct agent_latency_window w = {
    .step_id = step_id,
    .duration_us = 250000,
    .max_idle_exit_us = 50,
    .min_cpu_capacity_pct = 80,
    .nvme_exit_latency_us = 100,
};

ioctl(agent_fd, AGENT_IOC_BEGIN_LATENCY_WINDOW, &w);
run_agent_step();
ioctl(agent_fd, AGENT_IOC_END_LATENCY_WINDOW, &w);

The policy must be bounded and auditable. Otherwise every application will call itself latency-critical and destroy system-wide efficiency.

7. Control Plane: Cgroup-Budgeted Agent Latency

The weakest version of an agent-aware kernel is one where every process can simply declare itself latency-critical. That design fails immediately: it invites hint abuse, burns power, and creates unfairness across tenants. A deployable design needs a budgeted control plane.

The kernel should treat agent latency hints like a scarce resource: granted per cgroup, refilled over time, automatically expired, and visible through tracepoints.

Budgeted latency hints: intent is accepted, privilege is metered Agent runtime BEGIN_LATENCY_WINDOW step_id deadline budget hotset + I/O hints cgroup budget tokens refill over time sched boost idle guardrail APST budget fallback normal If tokens are available, low-latency policy is applied. If tokens are exhausted, hints degrade to normal policy instead of overheating the platform.
Figure 5. Agent hints need kernel enforcement. A cgroup token bucket prevents every process from declaring itself latency-critical indefinitely.
/* include/linux/agent_latency.h: sketch */
struct agent_latency_budget {
        u64 max_tokens_ns;      /* max low-latency privilege */
        u64 tokens_ns;          /* current budget */
        u64 refill_rate_ns;     /* refill per accounting period */
        u64 last_refill_ns;
};

/* kernel/sched/core.c: sketch */
static bool agent_latency_try_charge(struct task_struct *p, u64 runtime_ns)
{
        struct agent_latency_budget *b = task_agent_budget(p);

        agent_latency_refill(b, ktime_get_ns());
        if (b->tokens_ns < runtime_ns)
                return false;

        b->tokens_ns -= runtime_ns;
        trace_agent_budget_charge(task_pid_nr(p), runtime_ns, b->tokens_ns);
        return true;
}

if (task_agent_latency(p)) {
        if (!agent_latency_try_charge(p, predicted_window_ns))
                clear_agent_latency_effective(p); /* hint becomes advisory only */
}

This design also creates a natural user-space contract: an agent runtime may ask for a latency window, but the kernel decides how much of that request is honored based on cgroup policy, thermal pressure, and system fairness.

8. Memory Hierarchy: Hotsets, SLC, and Memory-Side Cache Pressure

Agent workloads are not merely CPU-bound. They are memory-hierarchy sensitive. A small active set—tool schemas, prompt templates, vector index headers, retrieval metadata, repository maps, tokenizer tables, and active documents—may be touched repeatedly across steps. But the kernel usually sees only anonymous pages, file-backed mappings, and page-cache pressure.

On ARM systems with a shared last-level cache, System Level Cache, or memory-side cache, preserving the agent hotset can matter as much as raw CPU frequency. The wrong eviction decision converts a tight loop into mmap faults, cache refills, and storage reads.

Agent hotset pressure through the ARM memory hierarchy Agent looprepeated small touches L1/L2 LLC SLC / MSC DRAMpower states / latency SSDmmap refill / small I/O Kernel policyMADV_AGENT_HOTSETMADV_AGENT_STREAMreadahead biaseviction resistance The point is not to expose hardware cache controls directly to applications. The point is to let the kernel know which pages are loop-critical and which reads are transient.
Figure 6. Agent runtimes need a vocabulary for memory intent: protect the hotset, stream transient data, and avoid page-cache pollution.
/* include/uapi/asm-generic/mman-common.h: sketch */
#define MADV_AGENT_HOTSET        104 /* protect loop-critical data */
#define MADV_AGENT_STREAM        105 /* transient reads; avoid cache pollution */
#define MADV_AGENT_INDEX         106 /* mmap-backed index with sparse reuse */

/* mm/vmscan.c: sketch */
static bool folio_agent_protected(struct folio *folio)
{
        struct vm_area_struct *vma = folio_mapped_vma(folio);

        return vma && (vma->vm_flags & VM_AGENT_HOTSET) &&
               memcg_agent_budget_available(vma->vm_mm);
}

if (folio_agent_protected(folio))
        scan_resistance += AGENT_HOTSET_BONUS;

The policy should remain conservative. A hotset hint should not make pages unevictable; it should bias reclaim and readahead while the cgroup has latency budget remaining.

9. Beyond the Core: NoC / Mesh Interconnect Warmth

Modern ARM server and client SoCs are not just collections of cores. They include mesh fabrics, coherent interconnects, memory controllers, LLC/SLC slices, and device paths whose own power states can add latency. A core may be awake and at frequency while the path to memory or storage has downclocked.

This means the latency window should not stop at cpufreq and cpuidle. For the most latency-sensitive windows, a platform driver may need to translate the same agent signal into interconnect QoS or minimum bandwidth votes.

Agent latency window propagated beyond the CPU core Agent stepshort burst begins CPU freq cpuidle NoC / CMN memory controller Platform driverQoS votemin bandwidthexit-latency capthermal overrideauto-expire Resultwarm pathlower p99bounded watts The kernel API remains generic. Platform drivers decide whether the hint maps to interconnect bandwidth votes, memory-controller residency, or no-op behavior.
Figure 7. A complete latency window may need to keep the interconnect and memory path warm, not only the CPU core.
/* drivers/interconnect/agent_qos.c: sketch */
static int agent_icc_apply_window(struct agent_latency_window *w)
{
        if (!w->active || thermal_pressure_high())
                return 0;

        icc_set_bw(agent_path_cpu_mem,
                   w->min_interconnect_avg_bw,
                   w->min_interconnect_peak_bw);

        trace_agent_interconnect_vote(w->step_id,
                                      w->min_interconnect_peak_bw);
        return 0;
}

/* Auto-expire when the latency window ends. */
static void agent_icc_end_window(struct agent_latency_window *w)
{
        icc_set_bw(agent_path_cpu_mem, 0, 0);
}

This is where an ARM-specific agentic CPU story becomes interesting: not “disable power management,” but coordinate CPU, idle, memory, storage, and interconnect policies for short, bounded windows.

10. Evaluation Plan

A credible benchmark should not report only average throughput. It should report distributions and attribution:

MetricWhy it mattersExpected signal
p50/p95/p99 step latencyUser-visible latency lives in the tailLower p99 under latency window
Scheduler wakeup histogramShows whether CPU placement improvedFewer >500µs wakeups
cpuidle state residencyVerifies deep idle avoidance during active loopsFewer deep state entries only during windows
cpufreq ramp eventsShows frequency stabilityReduced low-frequency first-step penalty
Small-read p99Captures vector/index/document lookup stallsLower tail for 4KB–64KB reads
Energy per completed taskPrevents performance-only policy from wasting powerSmall energy increase for meaningful tail reduction
Success criterion: improve p99/p999 step latency without globally disabling power management.

11. Risks and Upstream Reality

There are obvious objections:

  • Hint abuse: every runtime may mark itself latency-critical.
  • Fairness: keeping cores warm for one cgroup can hurt others.
  • Energy regression: bad policy can destroy the reason to use efficient CPUs.
  • Device diversity: NVMe and ARM platform behavior varies widely.

The answer is to scope policies by cgroup, enforce budgets, expire hints automatically, expose tracepoints, and require benchmarks that include energy per task. The kernel should not blindly trust the application; it should accept intent and enforce policy.

Conclusion

Agentic AI workloads are revealing a systems problem that was easy to miss in batch-centric AI: the latency path is the operating system. On ARM CPUs, power management is a first-class part of that path.

The right goal is not to disable power saving. The right goal is latency-aware efficiency: keep the system efficient across the day, but keep the agent loop warm during the small windows where wakeups, page faults, and storage reads compound into visible delay.

One-line takeaway: ARM agentic CPUs need power policies that optimize not just watts per operation, but variance per agent step.

References

  1. Linux Kernel Documentation: Energy Aware Scheduling.
  2. Linux Kernel Documentation: CPU Idle Time Management.
  3. Linux Kernel Documentation: Multi-Queue Block IO Queueing Mechanism.
  4. Linux manual pages: io_uring registered buffers.
  5. NVM Express: NVMe Technology Power Management Features.
© 2026 MANISH AI. All rights reserved. ← All writings · GitHub
ARM Linux · Agentic workloads · · 14 min read