Agentic AI on ARM: Rethinking DVFS, IRQs, and Scheduler Wakeups
Why CPUFreq must become control-loop-aware: integrating schedutil, cpuidle, IRQ completions, and eBPF measurement for latency-amplified agent workloads.
Abstract. Agentic AI workloads execute tight control loops: wait for a tool, network reply, disk read, timer, or io_uring completion; wake up; parse; decide; repeat. On ARM systems, this pattern interacts badly with aggressive power management. During short waits, CPUFreq may downscale frequency, cpuidle may select deeper states, and IRQ/softirq completions may wake the agent onto a cold or noisy CPU. The result is latency amplification: frequency ramp latency, wakeup delay, and completion-path jitter repeated hundreds or thousands of times. This note proposes an agent-aware DVFS window integrated with the IRQ completion path and scheduler wakeup placement. The design extends, rather than replaces, Linux CPUFreq/schedutil, uclamp, cpuidle, NAPI, blk-mq, and scheduler traceability.
1. The New Slowpath: The CPU Goes Cold Between Agent Steps
Classic throughput tuning asks whether the CPU is busy enough. Agentic workloads ask something different: when the next completion arrives, is the CPU ready immediately?
Agent loop:
issue tool call / disk read / local RPC
wait in epoll_wait() or io_uring_enter()
completion arrives through IRQ / softirq / blk-mq / timer
agent wakes
parse result
decide next step
repeat × 1000
On ARM, the idle gap between steps can be long enough for the platform to save power, but short enough that the next step immediately pays the exit cost.
agent blocks
↓
schedutil observes low utilization
↓
CPU frequency decays
↓
cpuidle may choose deeper idle
↓
IRQ / completion arrives
↓
agent wakes on a cold CPU
↓
frequency ramps + idle exit + scheduler wakeup
↓
latency paid again next step
Core thesis: ARM Linux DVFS should not react only to average utilization. For agentic workloads, it should preserve performance across short control-loop gaps.
2. Current Linux Building Blocks
Linux CPUFreq provides a common core, governors, and platform scaling drivers. The schedutil governor uses scheduler utilization signals to select CPU frequency. Utilization clamping can influence schedutil’s frequency decision by clamping task utilization to performance requirements. CPU idle management separately selects idle states to save power when CPUs have no runnable work.
| Kernel mechanism | What it already does | Agentic gap |
|---|---|---|
| CPUFreq core | Provides frequency scaling framework and userspace interfaces. | Does not know that a short wait is part of an active reasoning loop. |
| schedutil | Maps scheduler utilization to frequency. | Can be reactive; a bursty agent may downscale between steps. |
| uclamp | Hints minimum/maximum utilization to scheduler and schedutil. | Useful but usually static; agent loops need time-windowed boost. |
| cpuidle | Selects deeper idle states for energy efficiency. | Deep idle exit latency can become visible at every agent completion. |
| IRQ/NAPI/blk-mq | Completes network/storage/timer events. | Completion does not carry agent-step criticality into DVFS and wakeup policy. |
3. Latency Amplification Model
Small costs become large when the loop repeats. Even if each stage is only tens or hundreds of microseconds, an agent can execute thousands of waits and completions.
per-step delay =
IRQ / completion handling
+ softirq or blk-mq completion
+ wakeup-to-run latency
+ cpuidle exit latency
+ CPUFreq ramp latency
+ userspace gate return
Total task delay = per-step delay × number_of_steps
4. Design: Agent-Aware DVFS Windows
The design is intentionally incremental. It does not replace CPUFreq, schedutil, uclamp, or EEVDF. Instead, it adds a short-lived agent-active window that raises the effective utilization seen by schedutil and constrains cpuidle during active reasoning loops.
4.1 Task Flag
/* include/uapi/linux/sched.h */
#define SCHED_FLAG_AGENT_LATENCY 0x10000000
4.2 Per-task and Per-runqueue State
/* include/linux/sched.h */
struct task_struct {
#ifdef CONFIG_AGENT_LATENCY
u64 agent_active_until_ns;
unsigned int agent_min_util;
u32 agent_id;
u64 agent_step_id;
#endif
};
/* kernel/sched/sched.h */
struct rq {
#ifdef CONFIG_AGENT_LATENCY
u64 agent_active_until_ns;
unsigned int agent_min_util;
int agent_home_cpu;
#endif
};
4.3 Userspace API
A runtime can mark only the control-loop thread as agent-latency sensitive. This should not be applied blindly to every worker.
#include <linux/sched/types.h>
#include <sys/syscall.h>
#include <unistd.h>
#ifndef SCHED_FLAG_AGENT_LATENCY
#define SCHED_FLAG_AGENT_LATENCY 0x10000000
#endif
static int mark_agent_control_thread(void)
{
struct sched_attr attr = {
.size = sizeof(attr),
.sched_policy = SCHED_NORMAL,
.sched_flags = SCHED_FLAG_AGENT_LATENCY,
.sched_nice = -5,
};
return syscall(SYS_sched_setattr, 0, &attr, 0);
}
5. schedutil Hook: Time-Windowed Minimum Utilization
The key change is in the frequency selection path. During an active agent window, schedutil should see a minimum effective utilization for the CPU or performance domain.
/* kernel/sched/cpufreq_schedutil.c -- conceptual */
static unsigned long
agent_boost_util(struct rq *rq, unsigned long util, unsigned long max)
{
#ifdef CONFIG_AGENT_LATENCY
u64 now = ktime_get_ns();
if (rq->agent_active_until_ns > now)
util = max(util, (unsigned long)rq->agent_min_util);
#endif
return min(util, max);
}
/* Called before selecting next frequency. */
util = agent_boost_util(rq, util, max_cap);
next_f = get_next_freq(sg_policy, util, max_cap);
| Mechanism | Semantics | Agentic limitation |
|---|---|---|
| performance governor | Always high frequency. | Power-heavy and not multi-tenant friendly. |
| uclamp.min | Clamp task utilization to a minimum performance point. | Useful but often static; does not encode active step windows. |
| Agent DVFS window | Keep frequency warm for short gaps after agent activity/completion. | Needs careful budget enforcement. |
6. cpuidle Coordination: Avoid Deep Sleep During Active Windows
DVFS alone is not enough. If the CPU enters a deep idle state between steps, the agent can still pay idle-exit latency. The agent window should also provide a latency constraint to cpuidle.
/* drivers/cpuidle/governors/menu.c or teo.c -- conceptual */
static int agent_limit_idle_state(struct cpuidle_driver *drv,
struct cpuidle_device *dev,
int candidate)
{
#ifdef CONFIG_AGENT_LATENCY
struct rq *rq = cpu_rq(dev->cpu);
u64 now = ktime_get_ns();
if (rq->agent_active_until_ns > now) {
/*
* Avoid states whose target residency or exit latency exceeds
* the control-loop budget.
*/
while (candidate > 0 &&
drv->states[candidate].exit_latency > AGENT_IDLE_EXIT_LIMIT_US)
candidate--;
}
#endif
return candidate;
}
Guardrail: This must be budgeted. Otherwise every process will claim to be latency critical and the platform will run hot. The control plane should be tied to cgroups, runtime limits, or explicit admin policy.
7. Integration with IRQ + Scheduler Completion Path
The DVFS window should not be triggered only when userspace runs. It should also be refreshed when an agent-critical completion arrives: NIC packet, NVMe read completion, timer event, or io_uring CQE.
Network:
NIC MSI-X
→ generic IRQ
→ NAPI / NET_RX softirq
→ socket ready
→ agent wakeup
→ refresh agent DVFS window
Disk:
NVMe MSI-X
→ blk_mq_complete_request()
→ io_uring CQE / read wakeup
→ agent wakeup
→ refresh agent DVFS window
8. Proposed Patch Series
[PATCH 0/7] sched/cpufreq: agent-aware DVFS windows for latency-amplified workloads
[PATCH 1/7] sched: add SCHED_FLAG_AGENT_LATENCY for control-loop tasks
[PATCH 2/7] sched: track agent_active_until_ns and agent_min_util on rq
[PATCH 3/7] schedutil: apply time-windowed agent min-util boost
[PATCH 4/7] cpuidle: bound deep idle selection during active agent windows
[PATCH 5/7] irq/sched: refresh agent DVFS window on tagged completions
[PATCH 6/7] trace: add agent_dvfs_window and agent_freq_transition tracepoints
[PATCH 7/7] selftests: add agent loop DVFS latency microbenchmark
9. eBPF Measurement: DVFS Ramp Tracing
The measurement goal is to separate four delays:
1. completion arrives
2. task becomes runnable
3. task actually runs
4. CPU frequency reaches target
Linux exposes power tracepoints for CPU frequency and idle transitions, and scheduler tracepoints for wakeup/switch behavior. On many ARM systems, power:cpu_frequency is available and records frequency transitions.
9.1 CPU Frequency Transition Trace
sudo bpftrace -e '
tracepoint:power:cpu_frequency
{
printf("cpu=%d freq=%d kHz ts=%llu\n",
args->cpu_id, args->state, nsecs);
}'
9.2 CPU Idle Entry/Exit Trace
sudo bpftrace -e '
tracepoint:power:cpu_idle
{
printf("cpu=%d idle_state=%d ts=%llu\n",
args->cpu_id, args->state, nsecs);
}'
9.3 Wakeup-to-Run Latency
sudo bpftrace -e '
tracepoint:sched:sched_wakeup
/comm == "agent_loop"/
{
@wake[args->pid] = nsecs;
}
tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
printf("pid=%d wake_to_run=%dus cpu=%d\n",
args->next_pid,
(nsecs - @wake[args->next_pid]) / 1000,
cpu);
delete(@wake[args->next_pid]);
}'
9.4 epoll Gate Latency
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_epoll_wait
/comm == "agent_loop"/
{
@epoll[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_epoll_wait
/@epoll[tid]/
{
printf("tid=%d epoll_wait=%dus ret=%d\n",
tid, (nsecs - @epoll[tid]) / 1000, args->ret);
delete(@epoll[tid]);
}'
9.5 io_uring Gate Latency
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_io_uring_enter
/comm == "agent_loop"/
{
@uring[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_io_uring_enter
/@uring[tid]/
{
printf("tid=%d io_uring_enter=%dus ret=%d\n",
tid, (nsecs - @uring[tid]) / 1000, args->ret);
delete(@uring[tid]);
}'
9.6 Correlating Wakeup and Frequency Ramp
The useful derived metric is:
freq_ramp_after_wakeup =
first cpu_frequency event after sched_wakeup
- sched_wakeup timestamp
If this value appears repeatedly in the p99 tail, the workload is paying DVFS ramp latency at every agent step.
# Pseudocode for post-processing trace logs
for each agent_step:
t_complete = completion_timestamp(step)
t_wakeup = sched_wakeup_timestamp(step)
t_schedin = sched_switch_to_agent(step)
t_freq = first_cpu_frequency_after(t_wakeup)
print({
"step": step,
"completion_to_wakeup_us": (t_wakeup - t_complete) / 1000,
"wakeup_to_run_us": (t_schedin - t_wakeup) / 1000,
"freq_ramp_us": (t_freq - t_wakeup) / 1000,
})
10. Minimal Agent Loop Microbenchmark
This benchmark intentionally models an agent loop: wait, wake, do short CPU work, repeat. Replace the eventfd with a socket, local DB, or io_uring completion to test specific paths.
// agent_loop.c -- conceptual microbenchmark
#define _GNU_SOURCE
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
static int efd;
static void busy_parse_work(void)
{
volatile uint64_t x = 0;
for (int i = 0; i < 20000; i++)
x += i;
}
void *producer(void *arg)
{
uint64_t one = 1;
for (;;) {
usleep(500); // simulate tool/network/disk gap
write(efd, &one, sizeof(one));
}
return NULL;
}
int main(void)
{
efd = eventfd(0, EFD_NONBLOCK);
int ep = epoll_create1(0);
struct epoll_event ev = {
.events = EPOLLIN,
.data.fd = efd,
};
epoll_ctl(ep, EPOLL_CTL_ADD, efd, &ev);
pthread_t t;
pthread_create(&t, NULL, producer, NULL);
for (uint64_t step = 0; ; step++) {
struct epoll_event out;
epoll_wait(ep, &out, 1, -1);
uint64_t val;
read(efd, &val, sizeof(val));
busy_parse_work();
if ((step % 1000) == 0)
printf("step=%lu\n", step);
}
}
11. Evaluation Matrix
| Configuration | Description | Expected signal |
|---|---|---|
| Baseline | Default schedutil, default cpuidle, no pinning. | Frequency drops between short gaps; p99 step latency elevated. |
| performance governor | Force max frequency. | Improves latency but wastes power; upper-bound result. |
| uclamp.min | Use userspace clamp for agent control thread. | Shows whether existing scheduler hints are enough. |
| agent DVFS window | Time-windowed boost after each agent step/completion. | Lower p99 with less energy than always-performance. |
| agent DVFS + IRQ wakeup | Refresh DVFS window from tagged completions and wake on warm CPU. | Best control-loop tail latency. |
12. Why This Belongs With IRQ + Scheduler Work
The IRQ/scheduler work and ARM DVFS work are not separate projects. They are one control-path problem.
Completion path:
IRQ / softirq / blk-mq
↓
task becomes runnable
↓
scheduler chooses CPU
↓
CPUFreq / cpuidle determine how ready that CPU is
↓
userspace agent resumes
If the kernel improves wakeup placement but wakes the agent on a CPU that just downclocked or entered a deep idle state, latency remains. If DVFS keeps the CPU warm but completions are delayed behind softirq/ksoftirqd, latency remains. The performance win comes from coordinating all three:
agent-aware completion path
+ agent-aware wakeup placement
+ agent-aware DVFS/cpuidle window
= predictable agent step latency
13. Risks and Guardrails
Risk: Hint Abuse
Every workload may claim to be latency critical. The solution is cgroup-level budgeting, admin opt-in, and rate-limited windows.
Risk: Thermal Pressure
Keeping ARM cores warm too long can hurt power and thermals. The boost must decay quickly and respect thermal caps.
Risk: Fairness
The goal is not RT priority. The goal is bounded control-loop latency without starving normal tasks.
Risk: Platform Variance
Different ARM SoCs expose frequency and idle behavior differently. Tracepoints and policy must be optional and capability-aware.
14. Final Thesis
Agentic AI exposes a new ARM kernel slowpath: the CPU cooling down between reasoning steps. The fix is not simply max frequency. It is a budgeted, time-windowed DVFS policy integrated with IRQ completions, scheduler wakeups, and cpuidle.
The next useful kernel primitive is not only “task utilization.” It is the agent step window: a short, causally connected period during which completions, wakeups, CPU placement, and power state transitions should cooperate to minimize tail latency.
References
- Linux kernel CPUFreq documentation: CPU performance scaling core, governors, and drivers.
- Linux scheduler utilization clamping documentation: uclamp influences schedutil CPU frequency selection.
- Linux power tracepoint documentation: CPU frequency and CPU idle tracepoints.
- Linux generic IRQ handling documentation: architecture-independent IRQ handling layer.
- Linux NAPI documentation: interrupt notification followed by scheduled event processing.
- Linux blk-mq documentation: multi-queue block I/O for high-IOPS storage.