A New Kind of
Workload Shape
Modern AI agents don't behave like web servers. They don't run continuously, and they don't idle for long periods like batch jobs. They run in tight sequential control loops — a form of I/O-bound compute that aggressively triggers every power-management heuristic in the Linux kernel simultaneously.
Blocks on async request
Entire wake path executes
Often only 50–500 µs
N = thousands of steps
The compute phase is so short that the kernel's power management treats each wait as an invitation to save power. The CPU drops frequency, enters deep C-states, and may even migrate the task to a cold core. When the next IRQ arrives, every subsystem pays a re-entry penalty simultaneously.
"Agentic loops expose the gap between reactive DVFS and static uclamp; we need a moving control-plane window."
— github.com/manishklach/kernel-dvfs-agentic-latencyLatency Amplification
Each subsystem penalty is individually small. The catastrophe is multiplication — compounded across every step of a reasoning chain:
For a 2000-step agent chain with a conservative combined penalty of 300 µs per step, that's 600 ms of pure kernel overhead — invisible in any single trace, but dominating user-perceived latency at scale.
| Latency Source | Mechanism | Without patch | With patch |
|---|---|---|---|
| Idle exit (C6→C0) | cpuidle deep C-state wake | 200–500 µs | < 10 µs (C1) |
| DVFS ramp | schedutil PELT low-pass filter | 50–150 µs | ~0 (floored) |
| Scheduler migration | Load balancer to cold CPU | 30–80 µs (cache miss) | ~0 (affinity hint) |
| Wakeup-to-run | try_to_wake_up → context switch | 15–40 µs | 5–12 µs |
The Complete
Wake Path
Before writing a single line of kernel code, you must understand the precise sequence of events that occurs between "IRQ fires" and "agent code resumes." Every subsystem in this chain is a potential contributor to latency amplification.
Why existing solutions fail
uclamp sets a static utilization floor — it cannot adapt to the "open/closed" nature of the control loop window. If you set uclamp_min high enough to prevent DVFS sagging, you waste power during long waits. schedutil is purely reactive: PELT decays exponentially, so a short burst after a long sleep produces near-zero utilization, resulting in minimum frequency at the exact moment compute is most critical. The menu/teo cpuidle governors correctly predict that the task will sleep for a while (it's waiting on an API call), and correctly select deep C-states — which then produce multi-hundred-microsecond exit latencies for the next step.
The Active
Window Model
The core insight is to model the agentic control loop as a moving time-bounded window. When an IRQ wakes an agent task, we arm a window of duration agent_window_ns on the runqueue. While the window is open, all four power management subsystems operate in a "stay warm" mode. When the loop pauses or the window expires without renewal, power management resumes immediately — so we don't waste energy.
task_struct additions
We add two new fields to struct task_struct. These are the per-task configuration set once by userspace via sched_setattr(2). Placing them here keeps the hot-path read on the runqueue, not per-task:
/* include/linux/sched.h — added inside struct task_struct */ /** * agent_window_ns: duration in nanoseconds the "active window" * stays armed after each wakeup. Set by userspace via sched_setattr(). * 0 means feature disabled (default for all tasks). * * agent_min_util: minimum utilization signal (0–1024 SCHED_CAPACITY_SCALE) * injected into schedutil while the window is open. */ u64 agent_window_ns; /* e.g. 5000000 = 5 ms window */ u32 agent_min_util; /* e.g. 512 = 50% of max freq */
rq additions (hot path)
The runqueue is the per-CPU synchronization point that the scheduler, schedutil, and cpuidle all already touch on every scheduling event. Adding two fields here enables O(1) reads from all subsystems without any locking:
/* kernel/sched/sched.h — added inside struct rq */ u64 agent_active_until_ns; /* ktime_get() deadline */ u32 agent_min_util; /* currently asserted util floor */ /* Helper: is the agent window currently open? * Called from schedutil and cpuidle — must be lockless and O(1). */ static inline bool rq_agent_window_active(const struct rq *rq, u64 now) { return rq->agent_active_until_ns != 0 && now < rq->agent_active_until_ns; }
Cross-Subsystem
Integration
The patch touches exactly four files in the Linux kernel. Each change is minimal and self-contained — the design philosophy is to inject into the existing decision point rather than bypass it:
In try_to_wake_up(), when the waking task has agent_window_ns > 0, call task_agent_refresh_window() to arm the rq deadline. Also sets a wakeup-affinity hint to prefer the task's previous CPU, preserving L1/L2 cache state.
In sugov_update_single_freq(), intercept the raw PELT utilization before it's sent to the cpufreq driver. If the agent window is active, enforce util = max(util, rq→agent_min_util). This bypasses the PELT decay filter for the duration of the window.
In cpuidle_enter_state(), after the governor selects a C-state, check the agent window. If active, walk down from the selected C-state index until exit_latency ≤ AGENT_LATENCY_IDLE_EXIT_LIMIT_US. Ensures worst-case idle exit latency is bounded to ~10 µs (C1).
Window refresh happens inside try_to_wake_up(), which executes in IRQ or softirq context. This is the earliest possible moment — the CPU state is being warmed before the context switch completes, not after. This eliminates the "too late to help" problem of post-switch hooks.
The Actual
Kernel Diff
Userspace opt-in via sched_setattr()
No agent changes its behavior without explicitly opting in. The flag is namespaced under SCHED_FLAG_AGENT_LATENCY in the existing sched_attr struct:
/* Opt the current thread into agent latency mode. * agent_window_ns: how long to keep the CPU warm after each wakeup. * agent_min_util: minimum utilization floor (0–1024). */ static int set_agent_latency(u64 window_ns, u32 min_util) { struct sched_attr attr = { .size = sizeof(attr), .sched_flags = SCHED_FLAG_AGENT_LATENCY, /* new flag */ .sched_agent_window_ns = window_ns, .sched_agent_min_util = min_util, }; return syscall(SYS_sched_setattr, 0, &attr, 0); } int main(void) { /* 5 ms window, 50% minimum frequency floor */ set_agent_latency(5000000ULL, 512); while (1) { /* submit async work (io_uring, GPU dispatch, etc.) */ io_uring_submit(&ring, sqe); /* block until completion — kernel stays warm for us */ io_uring_wait_cqe(&ring, &cqe); /* fast compute: kernel CPU is already at target freq */ process_result(cqe); } }
Window refresh in try_to_wake_up()
/* In try_to_wake_up(), after the task is placed on the runqueue * and before the IRQ handler returns. * Executes in IRQ / softirq context — must be lock-free and fast. */ static inline void task_agent_refresh_window(struct rq *rq, struct task_struct *p) { u64 now, deadline; if (!p->agent_window_ns) /* not an agent task: fast-exit */ return; now = ktime_get_ns(); deadline = now + p->agent_window_ns; /* Extend or arm the window — always push out the deadline. */ WRITE_ONCE(rq->agent_active_until_ns, deadline); WRITE_ONCE(rq->agent_min_util, p->agent_min_util); /* Hint: prefer to wake on the task's last CPU to keep cache hot. */ set_task_cpu(p, p->wake_cpu); } /* In the main try_to_wake_up() body — add a single call: */ task_agent_refresh_window(__this_cpu_ptr(&runqueues), p);
schedutil DVFS boost hook
/* sugov_update_single_freq() — called by schedutil every DVFS tick. * 'util' is the raw PELT utilization signal. * We intercept it before it reaches the cpufreq driver. */ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, unsigned int flags) { struct sugov_cpu *sg_cpu = container_of(hook, ...); struct rq *rq = cpu_rq(sg_cpu->cpu); unsigned long util = cpu_util_cfs(rq); /* raw PELT */ /* ── AGENT LATENCY PATCH ── */ if (rq_agent_window_active(rq, time)) util = max(util, (unsigned long)rq->agent_min_util); /* ── END PATCH ── */ unsigned long freq = map_util_freq(util, sg_cpu->max, sg_cpu->max_cap); cpufreq_driver_fast_switch(policy, freq); }
cpuidle C-state cap
/* cpuidle_enter_state() — called just before the CPU halts. * 'index' is the C-state the governor selected (e.g. C6). * We veto deep states when the agent window is still open. */ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { struct rq *rq = this_rq(); u64 now = ktime_get_ns(); /* ── AGENT LATENCY PATCH ── */ if (rq_agent_window_active(rq, now)) { while (index > 0 && drv->states[index].exit_latency > AGENT_LATENCY_IDLE_EXIT_LIMIT_US) index--; /* index now points to C1 or C0 — exit latency < 10 µs */ } /* ── END PATCH ── */ return drv->states[index].enter(dev, drv, index); }
The kworker
Problem
Agent latency isn't only caused by direct hardware IRQs. In modern kernels, many completions are deferred through workqueues or hrtimer SoftIRQ callbacks. A kworker thread may execute the agent's completion on behalf of an I/O subsystem — meaning the wakeup path is longer and the window can be armed at the wrong moment.
The proposed fix tags the work_struct with the originating task's agent context at queue time. When the kworker picks up the work item, it checks for an inherited agent tag and arms a window on its own runqueue before executing the work function. This ensures the kworker itself runs fast — and the subsequent agent wakeup it triggers also finds the CPU already warm.
Measuring with
eBPF
Kernel patches without observability are hypotheses, not engineering. The repository ships three bpftrace scripts that measure every stage of the wake path independently, allowing precise attribution of which subsystem is consuming the latency budget.
Script 1 — wakeup_to_run latency histogram
#!/usr/bin/bpftrace /* Measures wakeup-to-run latency for agent-flagged tasks. * Attach: sudo bpftrace wakeup_to_run.bt */ kprobe:try_to_wake_up { $task = (struct task_struct *)arg0; if ($task->sched_flags & SCHED_FLAG_AGENT_LATENCY) { @wake_ts[$task->pid] = nsecs; } } kprobe:finish_task_switch { $prev = (struct task_struct *)arg0; $ts = @wake_ts[pid]; /* pid of incoming task */ if ($ts) { @wakeup_latency_us = hist((nsecs - $ts) / 1000); delete(@wake_ts[pid]); } } END { print(@wakeup_latency_us); clear(@wake_ts); }
Script 2 — DVFS ramp time (freq vs. time)
#!/usr/bin/bpftrace /* Tracks CPU frequency on the agent's CPU vs. elapsed wall time. * Run alongside the agent loop to visualise DVFS sagging. */ kprobe:cpufreq_driver_fast_switch { $cpu = arg0; $freq = arg1; /* kHz */ if ($cpu == cpu()) { printf("t=%llu freq=%u\n", nsecs, $freq); @freq_khz = hist($freq); } } END { print(@freq_khz); }
Script 3 — cpuidle C-state audit
#!/usr/bin/bpftrace /* Records which C-state is entered before each agent wakeup. * Deep C-states (C6+) with exit_latency > 100 µs are flagged. */ kprobe:cpuidle_enter_state { $dev = (struct cpuidle_device *)arg0; $drv = (struct cpuidle_driver *)arg1; $index = arg2; $lat = $drv->states[$index].exit_latency; /* µs */ @cstate_hist = hist($index); @exit_lat_us = hist($lat); if ($lat > 100) { @deep_sleeps = count(); printf("DEEP: cpu=%d C-state=%d exit_lat=%d µs\n", $dev->cpu, $index, $lat); } } END { print(@cstate_hist); print(@exit_lat_us); printf("Total deep-sleep entries: %llu\n", @deep_sleeps); }
Benchmark
Protocol
Subjective before/after comparisons are insufficient for kernel patches. The measurement plan specifies exact metrics, collection methods, and statistical requirements to produce publishable results:
Running the selftest
# 1. Build the benchmark cd selftests && make # 2. Baseline: no patch ./agent_loop --steps 1000 --interval-us 500 | tee baseline.txt # 3. Apply kernel patch and rebuild git apply ../patches/0000-kernel-agent-latency-control-plane-rfc.patch make -C /lib/modules/$(uname -r)/build M=$(pwd) modules sudo insmod agent_latency.ko # 4. With patch: 5 ms window, 50% util floor ./agent_loop --agent-latency --window-ns 5000000 --min-util 512 \ --steps 1000 --interval-us 500 | tee patched.txt # 5. Attach eBPF tracing (in a second terminal) sudo bpftrace tools/bpftrace/wakeup_to_run.bt sudo bpftrace tools/bpftrace/dvfs_ramp.bt sudo bpftrace tools/bpftrace/cstate_audit.bt # 6. Compare p99 latencies awk '/p99/{print}' baseline.txt patched.txt
Status and future work
This patch series is RFC — experimental. It is not upstream-ready. Known gaps before production consideration: cgroup-level token buckets to prevent power abuse by unprivileged agents; dynamic window sizing based on historical loop duration (exponential moving average of last N step times); and integration with the sched-ext framework for eBPF-accelerated scheduler policies that subsume the window logic entirely.
Agentic workloads need kernels that stay fast between steps — and stop punishing fast code for the crime of waiting.
— Patch 0000 cover letter