MANISH AI
Systems Note / ARM Kernel Proposal

Agentic AI on ARM: Rethinking DVFS, IRQs, and Scheduler Wakeups

Why CPUFreq must become control-loop-aware: integrating schedutil, cpuidle, IRQ completions, and eBPF measurement for latency-amplified agent workloads.

12 min read Focus: ARM Linux Subsystems: CPUFreq / schedutil / cpuidle / IRQ / scheduler Thesis: keep the CPU warm across agent steps

Abstract. Agentic AI workloads execute tight control loops: wait for a tool, network reply, disk read, timer, or io_uring completion; wake up; parse; decide; repeat. On ARM systems, this pattern interacts badly with aggressive power management. During short waits, CPUFreq may downscale frequency, cpuidle may select deeper states, and IRQ/softirq completions may wake the agent onto a cold or noisy CPU. The result is latency amplification: frequency ramp latency, wakeup delay, and completion-path jitter repeated hundreds or thousands of times. This note proposes an agent-aware DVFS window integrated with the IRQ completion path and scheduler wakeup placement. The design extends, rather than replaces, Linux CPUFreq/schedutil, uclamp, cpuidle, NAPI, blk-mq, and scheduler traceability.

1. The New Slowpath: The CPU Goes Cold Between Agent Steps

Classic throughput tuning asks whether the CPU is busy enough. Agentic workloads ask something different: when the next completion arrives, is the CPU ready immediately?

Agent loop:
  issue tool call / disk read / local RPC
  wait in epoll_wait() or io_uring_enter()
  completion arrives through IRQ / softirq / blk-mq / timer
  agent wakes
  parse result
  decide next step
  repeat × 1000

On ARM, the idle gap between steps can be long enough for the platform to save power, but short enough that the next step immediately pays the exit cost.

agent blocks
  ↓
schedutil observes low utilization
  ↓
CPU frequency decays
  ↓
cpuidle may choose deeper idle
  ↓
IRQ / completion arrives
  ↓
agent wakes on a cold CPU
  ↓
frequency ramps + idle exit + scheduler wakeup
  ↓
latency paid again next step

Core thesis: ARM Linux DVFS should not react only to average utilization. For agentic workloads, it should preserve performance across short control-loop gaps.

2. Current Linux Building Blocks

Linux CPUFreq provides a common core, governors, and platform scaling drivers. The schedutil governor uses scheduler utilization signals to select CPU frequency. Utilization clamping can influence schedutil’s frequency decision by clamping task utilization to performance requirements. CPU idle management separately selects idle states to save power when CPUs have no runnable work.

Kernel mechanismWhat it already doesAgentic gap
CPUFreq coreProvides frequency scaling framework and userspace interfaces.Does not know that a short wait is part of an active reasoning loop.
schedutilMaps scheduler utilization to frequency.Can be reactive; a bursty agent may downscale between steps.
uclampHints minimum/maximum utilization to scheduler and schedutil.Useful but usually static; agent loops need time-windowed boost.
cpuidleSelects deeper idle states for energy efficiency.Deep idle exit latency can become visible at every agent completion.
IRQ/NAPI/blk-mqCompletes network/storage/timer events.Completion does not carry agent-step criticality into DVFS and wakeup policy.

3. Latency Amplification Model

Small costs become large when the loop repeats. Even if each stage is only tens or hundreds of microseconds, an agent can execute thousands of waits and completions.

per-step delay =
  IRQ / completion handling
+ softirq or blk-mq completion
+ wakeup-to-run latency
+ cpuidle exit latency
+ CPUFreq ramp latency
+ userspace gate return

Total task delay = per-step delay × number_of_steps
Agent Step Meets ARM Power Management The wait gap looks idle to the kernel, but it is active time in the agent control loop. Waitepoll/io_uring DownscaleCPUFreq/cpuidle CompletionIRQ / blk / NAPI Wakeupscheduler Rampfrequency rises Runparse/decide Proposed fix: time-windowed agent DVFS boost + shallow idle guard + IRQ-aware wakeup placement Keep CPU warm during short gaps; decay only after the agent window expires.
Figure 1. Agentic workloads turn DVFS and idle exit latency into repeated control-loop overhead.

4. Design: Agent-Aware DVFS Windows

The design is intentionally incremental. It does not replace CPUFreq, schedutil, uclamp, or EEVDF. Instead, it adds a short-lived agent-active window that raises the effective utilization seen by schedutil and constrains cpuidle during active reasoning loops.

4.1 Task Flag

/* include/uapi/linux/sched.h */

#define SCHED_FLAG_AGENT_LATENCY 0x10000000

4.2 Per-task and Per-runqueue State

/* include/linux/sched.h */

struct task_struct {
#ifdef CONFIG_AGENT_LATENCY
    u64 agent_active_until_ns;
    unsigned int agent_min_util;
    u32 agent_id;
    u64 agent_step_id;
#endif
};
/* kernel/sched/sched.h */

struct rq {
#ifdef CONFIG_AGENT_LATENCY
    u64 agent_active_until_ns;
    unsigned int agent_min_util;
    int agent_home_cpu;
#endif
};

4.3 Userspace API

A runtime can mark only the control-loop thread as agent-latency sensitive. This should not be applied blindly to every worker.

#include <linux/sched/types.h>
#include <sys/syscall.h>
#include <unistd.h>

#ifndef SCHED_FLAG_AGENT_LATENCY
#define SCHED_FLAG_AGENT_LATENCY 0x10000000
#endif

static int mark_agent_control_thread(void)
{
    struct sched_attr attr = {
        .size = sizeof(attr),
        .sched_policy = SCHED_NORMAL,
        .sched_flags = SCHED_FLAG_AGENT_LATENCY,
        .sched_nice = -5,
    };

    return syscall(SYS_sched_setattr, 0, &attr, 0);
}

5. schedutil Hook: Time-Windowed Minimum Utilization

The key change is in the frequency selection path. During an active agent window, schedutil should see a minimum effective utilization for the CPU or performance domain.

/* kernel/sched/cpufreq_schedutil.c -- conceptual */

static unsigned long
agent_boost_util(struct rq *rq, unsigned long util, unsigned long max)
{
#ifdef CONFIG_AGENT_LATENCY
    u64 now = ktime_get_ns();

    if (rq->agent_active_until_ns > now)
        util = max(util, (unsigned long)rq->agent_min_util);
#endif
    return min(util, max);
}
/* Called before selecting next frequency. */

util = agent_boost_util(rq, util, max_cap);
next_f = get_next_freq(sg_policy, util, max_cap);
MechanismSemanticsAgentic limitation
performance governorAlways high frequency.Power-heavy and not multi-tenant friendly.
uclamp.minClamp task utilization to a minimum performance point.Useful but often static; does not encode active step windows.
Agent DVFS windowKeep frequency warm for short gaps after agent activity/completion.Needs careful budget enforcement.

6. cpuidle Coordination: Avoid Deep Sleep During Active Windows

DVFS alone is not enough. If the CPU enters a deep idle state between steps, the agent can still pay idle-exit latency. The agent window should also provide a latency constraint to cpuidle.

/* drivers/cpuidle/governors/menu.c or teo.c -- conceptual */

static int agent_limit_idle_state(struct cpuidle_driver *drv,
                                  struct cpuidle_device *dev,
                                  int candidate)
{
#ifdef CONFIG_AGENT_LATENCY
    struct rq *rq = cpu_rq(dev->cpu);
    u64 now = ktime_get_ns();

    if (rq->agent_active_until_ns > now) {
        /*
         * Avoid states whose target residency or exit latency exceeds
         * the control-loop budget.
         */
        while (candidate > 0 &&
               drv->states[candidate].exit_latency > AGENT_IDLE_EXIT_LIMIT_US)
            candidate--;
    }
#endif
    return candidate;
}

Guardrail: This must be budgeted. Otherwise every process will claim to be latency critical and the platform will run hot. The control plane should be tied to cgroups, runtime limits, or explicit admin policy.

7. Integration with IRQ + Scheduler Completion Path

The DVFS window should not be triggered only when userspace runs. It should also be refreshed when an agent-critical completion arrives: NIC packet, NVMe read completion, timer event, or io_uring CQE.

Network:
  NIC MSI-X
    → generic IRQ
    → NAPI / NET_RX softirq
    → socket ready
    → agent wakeup
    → refresh agent DVFS window

Disk:
  NVMe MSI-X
    → blk_mq_complete_request()
    → io_uring CQE / read wakeup
    → agent wakeup
    → refresh agent DVFS window
Completion-Aware DVFS Agent-critical completions refresh a short performance window before userspace resumes. IRQ / CompletionNIC / NVMe / timer Agent Tagagent_id / step_id DVFS Windowmin util until T Wakeup Placementwarm, IRQ-quiet CPU Result: the CPU is already at an acceptable performance point when the agent thread runs. The completion path and the power-management path cooperate instead of acting independently.
Figure 2. The IRQ completion path should refresh a short DVFS window before scheduler wakeup completes.

8. Proposed Patch Series

[PATCH 0/7] sched/cpufreq: agent-aware DVFS windows for latency-amplified workloads

[PATCH 1/7] sched: add SCHED_FLAG_AGENT_LATENCY for control-loop tasks
[PATCH 2/7] sched: track agent_active_until_ns and agent_min_util on rq
[PATCH 3/7] schedutil: apply time-windowed agent min-util boost
[PATCH 4/7] cpuidle: bound deep idle selection during active agent windows
[PATCH 5/7] irq/sched: refresh agent DVFS window on tagged completions
[PATCH 6/7] trace: add agent_dvfs_window and agent_freq_transition tracepoints
[PATCH 7/7] selftests: add agent loop DVFS latency microbenchmark

9. eBPF Measurement: DVFS Ramp Tracing

The measurement goal is to separate four delays:

1. completion arrives
2. task becomes runnable
3. task actually runs
4. CPU frequency reaches target

Linux exposes power tracepoints for CPU frequency and idle transitions, and scheduler tracepoints for wakeup/switch behavior. On many ARM systems, power:cpu_frequency is available and records frequency transitions.

9.1 CPU Frequency Transition Trace

sudo bpftrace -e '
tracepoint:power:cpu_frequency
{
  printf("cpu=%d freq=%d kHz ts=%llu\n",
    args->cpu_id, args->state, nsecs);
}'

9.2 CPU Idle Entry/Exit Trace

sudo bpftrace -e '
tracepoint:power:cpu_idle
{
  printf("cpu=%d idle_state=%d ts=%llu\n",
    args->cpu_id, args->state, nsecs);
}'

9.3 Wakeup-to-Run Latency

sudo bpftrace -e '
tracepoint:sched:sched_wakeup
/comm == "agent_loop"/
{
  @wake[args->pid] = nsecs;
}

tracepoint:sched:sched_switch
/@wake[args->next_pid]/
{
  printf("pid=%d wake_to_run=%dus cpu=%d\n",
    args->next_pid,
    (nsecs - @wake[args->next_pid]) / 1000,
    cpu);
  delete(@wake[args->next_pid]);
}'

9.4 epoll Gate Latency

sudo bpftrace -e '
tracepoint:syscalls:sys_enter_epoll_wait
/comm == "agent_loop"/
{
  @epoll[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_epoll_wait
/@epoll[tid]/
{
  printf("tid=%d epoll_wait=%dus ret=%d\n",
    tid, (nsecs - @epoll[tid]) / 1000, args->ret);
  delete(@epoll[tid]);
}'

9.5 io_uring Gate Latency

sudo bpftrace -e '
tracepoint:syscalls:sys_enter_io_uring_enter
/comm == "agent_loop"/
{
  @uring[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_io_uring_enter
/@uring[tid]/
{
  printf("tid=%d io_uring_enter=%dus ret=%d\n",
    tid, (nsecs - @uring[tid]) / 1000, args->ret);
  delete(@uring[tid]);
}'

9.6 Correlating Wakeup and Frequency Ramp

The useful derived metric is:

freq_ramp_after_wakeup =
  first cpu_frequency event after sched_wakeup
  - sched_wakeup timestamp

If this value appears repeatedly in the p99 tail, the workload is paying DVFS ramp latency at every agent step.

# Pseudocode for post-processing trace logs

for each agent_step:
    t_complete = completion_timestamp(step)
    t_wakeup   = sched_wakeup_timestamp(step)
    t_schedin  = sched_switch_to_agent(step)
    t_freq     = first_cpu_frequency_after(t_wakeup)

    print({
      "step": step,
      "completion_to_wakeup_us": (t_wakeup - t_complete) / 1000,
      "wakeup_to_run_us":        (t_schedin - t_wakeup) / 1000,
      "freq_ramp_us":            (t_freq - t_wakeup) / 1000,
    })

10. Minimal Agent Loop Microbenchmark

This benchmark intentionally models an agent loop: wait, wake, do short CPU work, repeat. Replace the eventfd with a socket, local DB, or io_uring completion to test specific paths.

// agent_loop.c -- conceptual microbenchmark
#define _GNU_SOURCE
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>

static int efd;

static void busy_parse_work(void)
{
    volatile uint64_t x = 0;
    for (int i = 0; i < 20000; i++)
        x += i;
}

void *producer(void *arg)
{
    uint64_t one = 1;

    for (;;) {
        usleep(500);              // simulate tool/network/disk gap
        write(efd, &one, sizeof(one));
    }
    return NULL;
}

int main(void)
{
    efd = eventfd(0, EFD_NONBLOCK);
    int ep = epoll_create1(0);

    struct epoll_event ev = {
        .events = EPOLLIN,
        .data.fd = efd,
    };

    epoll_ctl(ep, EPOLL_CTL_ADD, efd, &ev);

    pthread_t t;
    pthread_create(&t, NULL, producer, NULL);

    for (uint64_t step = 0; ; step++) {
        struct epoll_event out;
        epoll_wait(ep, &out, 1, -1);

        uint64_t val;
        read(efd, &val, sizeof(val));

        busy_parse_work();

        if ((step % 1000) == 0)
            printf("step=%lu\n", step);
    }
}

11. Evaluation Matrix

ConfigurationDescriptionExpected signal
BaselineDefault schedutil, default cpuidle, no pinning.Frequency drops between short gaps; p99 step latency elevated.
performance governorForce max frequency.Improves latency but wastes power; upper-bound result.
uclamp.minUse userspace clamp for agent control thread.Shows whether existing scheduler hints are enough.
agent DVFS windowTime-windowed boost after each agent step/completion.Lower p99 with less energy than always-performance.
agent DVFS + IRQ wakeupRefresh DVFS window from tagged completions and wake on warm CPU.Best control-loop tail latency.

12. Why This Belongs With IRQ + Scheduler Work

The IRQ/scheduler work and ARM DVFS work are not separate projects. They are one control-path problem.

Completion path:
  IRQ / softirq / blk-mq
      ↓
  task becomes runnable
      ↓
  scheduler chooses CPU
      ↓
  CPUFreq / cpuidle determine how ready that CPU is
      ↓
  userspace agent resumes

If the kernel improves wakeup placement but wakes the agent on a CPU that just downclocked or entered a deep idle state, latency remains. If DVFS keeps the CPU warm but completions are delayed behind softirq/ksoftirqd, latency remains. The performance win comes from coordinating all three:

agent-aware completion path
+ agent-aware wakeup placement
+ agent-aware DVFS/cpuidle window
= predictable agent step latency

13. Risks and Guardrails

Risk: Hint Abuse

Every workload may claim to be latency critical. The solution is cgroup-level budgeting, admin opt-in, and rate-limited windows.

Risk: Thermal Pressure

Keeping ARM cores warm too long can hurt power and thermals. The boost must decay quickly and respect thermal caps.

Risk: Fairness

The goal is not RT priority. The goal is bounded control-loop latency without starving normal tasks.

Risk: Platform Variance

Different ARM SoCs expose frequency and idle behavior differently. Tracepoints and policy must be optional and capability-aware.

14. Final Thesis

Agentic AI exposes a new ARM kernel slowpath: the CPU cooling down between reasoning steps. The fix is not simply max frequency. It is a budgeted, time-windowed DVFS policy integrated with IRQ completions, scheduler wakeups, and cpuidle.

The next useful kernel primitive is not only “task utilization.” It is the agent step window: a short, causally connected period during which completions, wakeups, CPU placement, and power state transitions should cooperate to minimize tail latency.

References

  1. Linux kernel CPUFreq documentation: CPU performance scaling core, governors, and drivers.
  2. Linux scheduler utilization clamping documentation: uclamp influences schedutil CPU frequency selection.
  3. Linux power tracepoint documentation: CPU frequency and CPU idle tracepoints.
  4. Linux generic IRQ handling documentation: architecture-independent IRQ handling layer.
  5. Linux NAPI documentation: interrupt notification followed by scheduled event processing.
  6. Linux blk-mq documentation: multi-queue block I/O for high-IOPS storage.