MANISH AI
Part 2 of N Linux Kernel Deep Dive  ·  May 2026

From Observability
to Optimization

Part 1 mapped the slowpaths. Part 2 asks the harder question: what should the kernel actually do differently — and how do we prevent those changes from becoming global power overrides?

RFC Patch Series reactive → predictive 7 optimization targets cgroup budget model blk-mq · MM · VFS
← Part 1
The Latency Control Plane

Abstract. Part 1 instrumented the agentic slowpath across DVFS, cpuidle, scheduler wakeups, deferred completions, io_uring, blk-mq, MM/page cache, VFS metadata, and cgroup budgets. This second part moves from visibility to action. The central argument: the kernel should not optimize agent loops as independent wakeups. It should treat them as a repeated temporal control path — event submitted, event completed, agent resumed, next event submitted. That shift changes how we think about every subsystem in the chain, and crucially, how we prevent well-intentioned latency hints from becoming system-wide power overrides.

01 — Recap

What Part 1
Established

Part 1 introduced the agentic slowpath and showed that latency amplification isn't a single-subsystem problem. The complete kernel control path a wakeup traverses is:

control path — every agent step
IRQ / timer / workqueue / I/O / MM / VFS
        ↓
completion / wakeup
        ↓
scheduler placement
        ↓
DVFS / CPUFreq
        ↓
cpuidle
        ↓
userspace resumes

The key observation: no single hop is catastrophic. The problem is multiplication across steps. An agent repeating this path thousands of times turns sub-millisecond overhead into user-visible latency:

Latency_total = (T_irq + T_sched + T_dvfs + T_idle_exit + T_io + T_mm + T_vfs) × N_steps

"Part 1: make the slowpath visible. Part 2: decide how the kernel should optimize it."

— github.com/manishklach/kernel-dvfs-agentic-latency
02 — The Core Insight

Reactive is Wrong
for Control Loops

All current Linux power management is built on a reactive model: observe utilization, idle time, runnable state, queue pressure, cache state — then respond. That is correct for batch jobs and stable-load services. For agentic control loops it is systematically wrong.

Subsystem Reactive signal today Why it fails for agents
Scheduler runnable tasks, fairness, load doesn't know a wakeup is step N of a control loop
schedutil / DVFS PELT utilization signal short waits look like reduced demand → frequency sags
cpuidle predicted idle duration correctly selects deep C-states; pays 200–500 µs exit
IRQ / softirq device completions and budgets completion criticality is invisible to the kernel
blk-mq / io_uring request issue/completion step-gating reads look identical to background scans
MM / VFS faults, cache hits, metadata lookups hot agent working set and cold traversals look the same

Agentic workloads have a property no reactive heuristic can infer: short-term recurrence. If the kernel just woke a control-loop agent, the probability of another wait/wake/compute cycle within the next few milliseconds is near-certain. That predictability is the lever the patch series exploits.

Thesis. The kernel needs a short-lived temporal signal — an agent latency window — that coordinates power, scheduling, completion, and memory policy across all relevant subsystems simultaneously, and expires immediately when the loop pauses.

03 — Before / After

The Same Loop,
Two Kernels

The optimization target is not a single wakeup. It is the repeated gap between wakeups — the window where the reactive kernel tears down warmth that the next step will immediately need.

Fig 1 — Reactive kernel vs. agent-aware kernel (same workload)
BEFORE: REACTIVE KERNEL wait epoll / io_uring downclock PELT decays → min freq deep C-state entered IRQ fires completion ramp penalty C-state exit: 200–500 µs freq ramp: 50–150 µs run finally warm paid every step × N AFTER: AGENT-AWARE KERNEL (active window) wait epoll / io_uring stay warm window open: util ≥ min cpuidle ≤ C1 IRQ fires window refresh warm wakeup C1 exit: < 10 µs no freq ramp run already at freq agent_active_until_ns armed; expires if loop pauses → PM resumes IF LOOP PAUSES → WINDOW EXPIRES → POWER MANAGEMENT RESUMES IMMEDIATELY zero energy wasted when agent is genuinely idle
▲ The reactive kernel tears down warmth during every wait. The agent-aware kernel holds it for a bounded window — then releases it immediately if the loop stops.
04 — Optimization Targets

Seven Subsystems,
One Control Signal

The design principle is consistent across all seven targets: inject the agent window into the existing decision point, rather than bypassing it. Each subsystem already has the right hook — the patch merely adds a time-bounded condition check at that hook.

TARGET 01 / 07
Completion-Aware Wakeups
kernel/sched/core.c → try_to_wake_up()

Completions that gate the next agent step should refresh the window before userspace resumes — at the earliest possible moment in the wake path, not after the context switch completes. The patch adds a single call to task_agent_refresh_window() inside try_to_wake_up(), which runs in IRQ or softirq context.

pre-context-switch
TARGET 02 / 07
Scheduler Wakeup Placement
kernel/sched/core.c → select_task_rq()

Migrating the agent to a cold CPU on every wakeup blows the L1/L2 cache and may land on a different DVFS domain. When the agent window is active, a conservative affinity hint prefers the previous CPU if it is not IRQ-hot, not deeply idle, and in the same performance domain. This is a hint — EEVDF fairness is not bypassed.

placement policy
if (task_agent_latency(p) && rq_agent_window_active(rq, now))
    prefer p->wake_cpu /* if not irq-hot, not deep-idle */
cache locality
TARGET 03 / 07
DVFS Stickiness (schedutil)
kernel/sched/cpufreq_schedutil.c → sugov_update_single_freq()

The PELT decay filter produces near-zero utilization after a long wait — exactly when the next step needs full frequency. The patch intercepts the raw utilization in sugov_update_single_freq() and enforces util = max(util, rq→agent_min_util) while the window is active. Unlike the performance governor, this is time-bounded and per-task.

schedutil intercept
if (rq_agent_window_active(rq, time))
    util = max(util, rq->agent_min_util);
util floor
TARGET 04 / 07
cpuidle Exit-Latency Guard
drivers/cpuidle/cpuidle.c → cpuidle_enter_state()

Frequency is only half the story. A CPU at max frequency in C6 still pays 200–500 µs on the next IRQ. The patch adds a post-governor veto: after the governor selects a C-state, walk down to the shallowest state whose exit_latency ≤ AGENT_LATENCY_IDLE_EXIT_LIMIT_US. Deep idle is not disabled globally — only during the bounded window.

C-state veto
while (index > 0 &&
       drv->states[index].exit_latency > LIMIT_US)
    index--;  /* walk down to C1 or C0 */
C-state cap
TARGET 05 / 07
blk-mq and io_uring
block/blk-mq.c · io_uring/io_uring.c

Agents often depend on local storage: vector indexes, local DBs, tokenizer files, cached documents. Step-gating reads — a small random read that determines when the next step starts — are currently indistinguishable from background I/O scans. The policy direction: when a request carries an agent-critical tag, prefer latency over batching and refresh the agent window on completion. Phase 1 is attribution (making the read visible); policy follows once the data is collected.

step-gating reads
TARGET 06 / 07
MM and Page Cache
mm/filemap.c · mm/vmscan.c

Agents repeatedly touch mmap'd indexes, embedding tables, and document caches — but also do large cold scans during retrieval. These look identical to the reclaim subsystem, which will evict the hot working set to make room for the scan. The proposed future hints separate the two cases:

hot
MADV_AGENT_HOTSET

Small active set — resist eviction; likely touched in next step.

cold
MADV_AGENT_SCAN

Large cold traversal — don't evict the hotset to accommodate this.

measurement first
TARGET 07 / 07
VFS Metadata
fs/namei.c · fs/dcache.c

Coding and retrieval agents are metadata-heavy: openat(), statx(), getdents64(), path lookup, dentry lookup, inode fetch. Repeated cold dentry lookups across a deep repository tree add step-by-step overhead invisible in any single trace. The optimization direction is identifying metadata hotsets and reducing repeated cold lookups — not bypassing the VFS. Phase 1 is observability; the current repo makes the path visible.

dentry hotset
05 — Enforcement

The Budget Layer:
Safety Without Compromise

Any mechanism that keeps CPUs warm, limits deep idle, or prioritizes completions can be abused. If every process claims "I am latency-critical," the system loses fairness, power efficiency, and thermal headroom. The patch series addresses this directly with a cgroup token-bucket model — the last and most important piece of the design.

Fig 2 — Cgroup token bucket: bounded latency control
cgroup token bucket tokens available budget per cgroup / per agent + refill_rate × time controlled by cgroup config − cost_per_window per agent step tokens available full agent latency window granted DVFS floor + cpuidle cap active tokens exhausted degrade gracefully fall back to reactive PM
▲ The token bucket prevents privilege escalation: sustained high-frequency agents deplete their budget and gracefully fall back to standard power management.
C — cgroup budget accounting (conceptual)
/* Called at each agent window arm — before granting the window. */
static bool agent_budget_check_and_debit(struct cgroup_agent_budget *bgt)
{
    u64 now = ktime_get_ns();
    u64 refill;

    /* Refill tokens based on elapsed time */
    refill = min((now - bgt->last_refill_ns) * bgt->refill_rate_per_ns,
                 bgt->max_tokens - bgt->tokens);
    bgt->tokens        += refill;
    bgt->last_refill_ns = now;

    if (bgt->tokens < bgt->cost_per_window)
        return false;  /* budget exhausted: degrade gracefully */

    bgt->tokens -= bgt->cost_per_window;
    return true;  /* budget available: grant full window */
}

This transforms the agent window from a hint into a bounded control system. A well-behaved agent with a moderate step rate maintains its full budget indefinitely. A misbehaving agent that tries to hold the CPU warm continuously depletes its tokens and is throttled to standard reactive policy — without affecting any other cgroup.

06 — Anti-Patterns

What Not
to Do

The easy solutions are blunt and dangerous. They improve a benchmark while degrading the system for everyone else, and they don't survive production deployment.

The correct model is the opposite of all of these:

Design invariant. The agent window must be strictly time-bounded. Any mechanism that can be held open indefinitely by userspace becomes a global power governor override — which is exactly the problem we set out to fix.

07 — Roadmap

A Practical
Six-Phase Plan

The patch series follows a deliberate order: measurement before policy. No optimization is proposed without first making the slowpath visible and attributable. This is the sequence:

Phase Goal Kernel mechanism Status
1Observe Measure every slowpath in isolation tracepoints + eBPF bpftrace scripts done
2Attribute Connect events to agent steps agent_step_id / user_data io_uring convention done
3Preserve warmth Avoid repeated cold starts DVFS window + cpuidle guard (Patch 0000) RFC
4Place intelligently Wake on the right CPU scheduler locality hint + IRQ-hot avoidance planned
5Prioritize narrowly Protect only step-gating work blk-mq request class · MM hotset hints · VFS dentry planned
6Enforce budgets Prevent abuse; maintain fairness cgroup token bucket per-agent planned
Fig 3 — The agentic slowpath control plane: bounded, coordinated, expiring
Event IRQ/timer/I/O Completion softirq/blk/work Scheduler warm placement DVFS min util window cpuidle exit latency cap rq shared state — O(1) read, all subsystems agent_active_until_ns · agent_min_util armed at wakeup · expires automatically · budgeted by cgroup CGROUP BUDGET LAYER — prevents abuse · all four subsystems obey the same token bucket
▲ All subsystems read the same two rq fields in O(1). The budget layer sits beneath all of them — a single enforcement point for the entire control plane.
08 — What's Next

Part 3 Needs
Data, Not Architecture

A credible Part 3 should not add more design. It should add numbers — answering the questions the design raises but cannot answer without empirical measurement:

open questions — to be answered with data
/* How often does DVFS ramp occur after agent wakeups? */
/* How much wakeup-to-run latency is due to scheduler placement? */
/* How much deferred-completion delay comes from workqueues/timers? */
/* How often do page faults or VFS metadata lookups gate agent steps? */
/* How much does the cgroup budget reduce abuse while preserving p99? */
/* What is the energy cost of the agent window (Watts vs. baseline)? */

That is the line between a design note and a systems paper. The repo's docs/measurement-plan.md defines the experimental setup, eBPF collection methodology, and statistical requirements for each question. The selftest in selftests/agent_loop.c provides the controlled workload to answer them.

"The bottleneck is not just the model. It is the system between iterations."

— Patch 0000 cover letter, kernel-dvfs-agentic-latency

The new signal

The old optimization signal for kernel scheduling and power management was a single dimension: utilization. The signal the agentic kernel needs is three-dimensional:

new kernel signal — three dimensions
/* old */
signal = utilization

/* new */
signal = {
    utilization,         /* what PELT measures today            */
    temporal_intent,     /* "I will wake again within window_ns" */
    budget,              /* "I have N tokens of latency credit"  */
}

Agentic workloads are the first mainstream workload class where temporal intent — the knowledge that a wakeup is not isolated but part of a recurring sequence — is both predictable and actionable. The kernel has always had utilization. What it has never had is a way to say: this CPU will be needed again soon, keep it ready, but stop when the loop does.