Abstract. Part 1 instrumented the agentic slowpath across DVFS, cpuidle, scheduler wakeups, deferred completions, io_uring, blk-mq, MM/page cache, VFS metadata, and cgroup budgets. This second part moves from visibility to action. The central argument: the kernel should not optimize agent loops as independent wakeups. It should treat them as a repeated temporal control path — event submitted, event completed, agent resumed, next event submitted. That shift changes how we think about every subsystem in the chain, and crucially, how we prevent well-intentioned latency hints from becoming system-wide power overrides.
What Part 1
Established
Part 1 introduced the agentic slowpath and showed that latency amplification isn't a single-subsystem problem. The complete kernel control path a wakeup traverses is:
IRQ / timer / workqueue / I/O / MM / VFS
↓
completion / wakeup
↓
scheduler placement
↓
DVFS / CPUFreq
↓
cpuidle
↓
userspace resumes
The key observation: no single hop is catastrophic. The problem is multiplication across steps. An agent repeating this path thousands of times turns sub-millisecond overhead into user-visible latency:
"Part 1: make the slowpath visible. Part 2: decide how the kernel should optimize it."
— github.com/manishklach/kernel-dvfs-agentic-latencyReactive is Wrong
for Control Loops
All current Linux power management is built on a reactive model: observe utilization, idle time, runnable state, queue pressure, cache state — then respond. That is correct for batch jobs and stable-load services. For agentic control loops it is systematically wrong.
| Subsystem | Reactive signal today | Why it fails for agents |
|---|---|---|
| Scheduler | runnable tasks, fairness, load | doesn't know a wakeup is step N of a control loop |
| schedutil / DVFS | PELT utilization signal | short waits look like reduced demand → frequency sags |
| cpuidle | predicted idle duration | correctly selects deep C-states; pays 200–500 µs exit |
| IRQ / softirq | device completions and budgets | completion criticality is invisible to the kernel |
| blk-mq / io_uring | request issue/completion | step-gating reads look identical to background scans |
| MM / VFS | faults, cache hits, metadata lookups | hot agent working set and cold traversals look the same |
Agentic workloads have a property no reactive heuristic can infer: short-term recurrence. If the kernel just woke a control-loop agent, the probability of another wait/wake/compute cycle within the next few milliseconds is near-certain. That predictability is the lever the patch series exploits.
Thesis. The kernel needs a short-lived temporal signal — an agent latency window — that coordinates power, scheduling, completion, and memory policy across all relevant subsystems simultaneously, and expires immediately when the loop pauses.
The Same Loop,
Two Kernels
The optimization target is not a single wakeup. It is the repeated gap between wakeups — the window where the reactive kernel tears down warmth that the next step will immediately need.
Seven Subsystems,
One Control Signal
The design principle is consistent across all seven targets: inject the agent window into the existing decision point, rather than bypassing it. Each subsystem already has the right hook — the patch merely adds a time-bounded condition check at that hook.
Completions that gate the next agent step should refresh the window before userspace resumes — at the earliest possible moment in the wake path, not after the context switch completes. The patch adds a single call to task_agent_refresh_window() inside try_to_wake_up(), which runs in IRQ or softirq context.
Migrating the agent to a cold CPU on every wakeup blows the L1/L2 cache and may land on a different DVFS domain. When the agent window is active, a conservative affinity hint prefers the previous CPU if it is not IRQ-hot, not deeply idle, and in the same performance domain. This is a hint — EEVDF fairness is not bypassed.
if (task_agent_latency(p) && rq_agent_window_active(rq, now)) prefer p->wake_cpu /* if not irq-hot, not deep-idle */
The PELT decay filter produces near-zero utilization after a long wait — exactly when the next step needs full frequency. The patch intercepts the raw utilization in sugov_update_single_freq() and enforces util = max(util, rq→agent_min_util) while the window is active. Unlike the performance governor, this is time-bounded and per-task.
if (rq_agent_window_active(rq, time)) util = max(util, rq->agent_min_util);
Frequency is only half the story. A CPU at max frequency in C6 still pays 200–500 µs on the next IRQ. The patch adds a post-governor veto: after the governor selects a C-state, walk down to the shallowest state whose exit_latency ≤ AGENT_LATENCY_IDLE_EXIT_LIMIT_US. Deep idle is not disabled globally — only during the bounded window.
while (index > 0 && drv->states[index].exit_latency > LIMIT_US) index--; /* walk down to C1 or C0 */
Agents often depend on local storage: vector indexes, local DBs, tokenizer files, cached documents. Step-gating reads — a small random read that determines when the next step starts — are currently indistinguishable from background I/O scans. The policy direction: when a request carries an agent-critical tag, prefer latency over batching and refresh the agent window on completion. Phase 1 is attribution (making the read visible); policy follows once the data is collected.
step-gating readsAgents repeatedly touch mmap'd indexes, embedding tables, and document caches — but also do large cold scans during retrieval. These look identical to the reclaim subsystem, which will evict the hot working set to make room for the scan. The proposed future hints separate the two cases:
Small active set — resist eviction; likely touched in next step.
Large cold traversal — don't evict the hotset to accommodate this.
Coding and retrieval agents are metadata-heavy: openat(), statx(), getdents64(), path lookup, dentry lookup, inode fetch. Repeated cold dentry lookups across a deep repository tree add step-by-step overhead invisible in any single trace. The optimization direction is identifying metadata hotsets and reducing repeated cold lookups — not bypassing the VFS. Phase 1 is observability; the current repo makes the path visible.
The Budget Layer:
Safety Without Compromise
Any mechanism that keeps CPUs warm, limits deep idle, or prioritizes completions can be abused. If every process claims "I am latency-critical," the system loses fairness, power efficiency, and thermal headroom. The patch series addresses this directly with a cgroup token-bucket model — the last and most important piece of the design.
/* Called at each agent window arm — before granting the window. */ static bool agent_budget_check_and_debit(struct cgroup_agent_budget *bgt) { u64 now = ktime_get_ns(); u64 refill; /* Refill tokens based on elapsed time */ refill = min((now - bgt->last_refill_ns) * bgt->refill_rate_per_ns, bgt->max_tokens - bgt->tokens); bgt->tokens += refill; bgt->last_refill_ns = now; if (bgt->tokens < bgt->cost_per_window) return false; /* budget exhausted: degrade gracefully */ bgt->tokens -= bgt->cost_per_window; return true; /* budget available: grant full window */ }
This transforms the agent window from a hint into a bounded control system. A well-behaved agent with a moderate step rate maintains its full budget indefinitely. A misbehaving agent that tries to hold the CPU warm continuously depletes its tokens and is throttled to standard reactive policy — without affecting any other cgroup.
What Not
to Do
The easy solutions are blunt and dangerous. They improve a benchmark while degrading the system for everyone else, and they don't survive production deployment.
- Run the agent thread as
SCHED_FIFO 99— starves other tasks, produces priority inversion - Use the performance governor globally — wastes 15–40% system power continuously
- Disable deep idle system-wide — destroys laptop battery life and server rack efficiency
- Mark every I/O request from the agent process as high-priority — background indexing becomes latency-critical
- Bypass Linux policy entirely (VFIO, bare-metal busy-poll) — loses all multi-tenant fairness
The correct model is the opposite of all of these:
- Specific signal — only agent threads that opt in via
sched_setattr()withSCHED_FLAG_AGENT_LATENCY - Narrow window — expires automatically; no manual disarming required
- Measurable effect — every decision point is instrumented with a tracepoint
- Budgeted enforcement — cgroup token bucket prevents sustained power override
Design invariant. The agent window must be strictly time-bounded. Any mechanism that can be held open indefinitely by userspace becomes a global power governor override — which is exactly the problem we set out to fix.
A Practical
Six-Phase Plan
The patch series follows a deliberate order: measurement before policy. No optimization is proposed without first making the slowpath visible and attributable. This is the sequence:
| Phase | Goal | Kernel mechanism | Status |
|---|---|---|---|
| 1Observe | Measure every slowpath in isolation | tracepoints + eBPF bpftrace scripts | done |
| 2Attribute | Connect events to agent steps | agent_step_id / user_data io_uring convention |
done |
| 3Preserve warmth | Avoid repeated cold starts | DVFS window + cpuidle guard (Patch 0000) | RFC |
| 4Place intelligently | Wake on the right CPU | scheduler locality hint + IRQ-hot avoidance | planned |
| 5Prioritize narrowly | Protect only step-gating work | blk-mq request class · MM hotset hints · VFS dentry | planned |
| 6Enforce budgets | Prevent abuse; maintain fairness | cgroup token bucket per-agent | planned |
Part 3 Needs
Data, Not Architecture
A credible Part 3 should not add more design. It should add numbers — answering the questions the design raises but cannot answer without empirical measurement:
/* How often does DVFS ramp occur after agent wakeups? */ /* How much wakeup-to-run latency is due to scheduler placement? */ /* How much deferred-completion delay comes from workqueues/timers? */ /* How often do page faults or VFS metadata lookups gate agent steps? */ /* How much does the cgroup budget reduce abuse while preserving p99? */ /* What is the energy cost of the agent window (Watts vs. baseline)? */
That is the line between a design note and a systems paper. The repo's docs/measurement-plan.md defines the experimental setup, eBPF collection methodology, and statistical requirements for each question. The selftest in selftests/agent_loop.c provides the controlled workload to answer them.
"The bottleneck is not just the model. It is the system between iterations."
— Patch 0000 cover letter, kernel-dvfs-agentic-latencyThe new signal
The old optimization signal for kernel scheduling and power management was a single dimension: utilization. The signal the agentic kernel needs is three-dimensional:
/* old */ signal = utilization /* new */ signal = { utilization, /* what PELT measures today */ temporal_intent, /* "I will wake again within window_ns" */ budget, /* "I have N tokens of latency credit" */ }
Agentic workloads are the first mainstream workload class where temporal intent — the knowledge that a wakeup is not isolated but part of a recurring sequence — is both predictable and actionable. The kernel has always had utilization. What it has never had is a way to say: this CPU will be needed again soon, keep it ready, but stop when the loop does.