Agentic AI & the Linux Kernel — Part 2: From Observability to Optimization

Abstract. Part 1 instrumented the agentic slowpath across DVFS, cpuidle, scheduler wakeups, deferred completions, io_uring, blk-mq, MM/page cache, VFS metadata, and cgroup budgets. This second part moves from visibility to action. The central argument: the kernel should not optimize agent loops as independent wakeups. It should treat them as a repeated temporal control path — event submitted, event completed, agent resumed, next event submitted. That shift changes how we think about every subsystem in the chain, and crucially, how we prevent well-intentioned latency hints from becoming system-wide power overrides.

01 — Recap

What Part 1
Established

Part 1 introduced the agentic slowpath and showed that latency amplification isn't a single-subsystem problem. The complete kernel control path a wakeup traverses is:

control path — every agent step

IRQ / timer / workqueue / I/O / MM / VFS
        ↓
completion / wakeup
        ↓
scheduler placement
        ↓
DVFS / CPUFreq
        ↓
cpuidle
        ↓
userspace resumes

The key observation: no single hop is catastrophic. The problem is multiplication across steps. An agent repeating this path thousands of times turns sub-millisecond overhead into user-visible latency:

Latency_total = (T_irq + T_sched + T_dvfs + T_idle_exit + T_io + T_mm + T_vfs) × N_steps

"Part 1: make the slowpath visible. Part 2: decide how the kernel should optimize it."

— github.com/manishklach/kernel-dvfs-agentic-latency

02 — The Core Insight

Reactive is Wrong
for Control Loops

All current Linux power management is built on a reactive model: observe utilization, idle time, runnable state, queue pressure, cache state — then respond. That is correct for batch jobs and stable-load services. For agentic control loops it is systematically wrong.

Subsystem	Reactive signal today	Why it fails for agents
Scheduler	runnable tasks, fairness, load	doesn't know a wakeup is step N of a control loop
schedutil / DVFS	PELT utilization signal	short waits look like reduced demand → frequency sags
cpuidle	predicted idle duration	correctly selects deep C-states; pays 200–500 µs exit
IRQ / softirq	device completions and budgets	completion criticality is invisible to the kernel
blk-mq / io_uring	request issue/completion	step-gating reads look identical to background scans
MM / VFS	faults, cache hits, metadata lookups	hot agent working set and cold traversals look the same

Agentic workloads have a property no reactive heuristic can infer: short-term recurrence. If the kernel just woke a control-loop agent, the probability of another wait/wake/compute cycle within the next few milliseconds is near-certain. That predictability is the lever the patch series exploits.

Thesis. The kernel needs a short-lived temporal signal — an agent latency window — that coordinates power, scheduling, completion, and memory policy across all relevant subsystems simultaneously, and expires immediately when the loop pauses.

03 — Before / After

The Same Loop,
Two Kernels

The optimization target is not a single wakeup. It is the repeated gap between wakeups — the window where the reactive kernel tears down warmth that the next step will immediately need.

Fig 1 — Reactive kernel vs. agent-aware kernel (same workload)

▲ The reactive kernel tears down warmth during every wait. The agent-aware kernel holds it for a bounded window — then releases it immediately if the loop stops.

04 — Optimization Targets

Seven Subsystems,
One Control Signal

The design principle is consistent across all seven targets: inject the agent window into the existing decision point, rather than bypassing it. Each subsystem already has the right hook — the patch merely adds a time-bounded condition check at that hook.

TARGET 01 / 07

Completion-Aware Wakeups

kernel/sched/core.c → try_to_wake_up()

Completions that gate the next agent step should refresh the window before userspace resumes — at the earliest possible moment in the wake path, not after the context switch completes. The patch adds a single call to task_agent_refresh_window() inside try_to_wake_up(), which runs in IRQ or softirq context.

pre-context-switch

TARGET 02 / 07

Scheduler Wakeup Placement

kernel/sched/core.c → select_task_rq()

Migrating the agent to a cold CPU on every wakeup blows the L1/L2 cache and may land on a different DVFS domain. When the agent window is active, a conservative affinity hint prefers the previous CPU if it is not IRQ-hot, not deeply idle, and in the same performance domain. This is a hint — EEVDF fairness is not bypassed.

placement policy

if (task_agent_latency(p) && rq_agent_window_active(rq, now))
    prefer p->wake_cpu /* if not irq-hot, not deep-idle */

cache locality

TARGET 03 / 07

DVFS Stickiness (schedutil)

kernel/sched/cpufreq_schedutil.c → sugov_update_single_freq()

The PELT decay filter produces near-zero utilization after a long wait — exactly when the next step needs full frequency. The patch intercepts the raw utilization in sugov_update_single_freq() and enforces util = max(util, rq→agent_min_util) while the window is active. Unlike the performance governor, this is time-bounded and per-task.

schedutil intercept

if (rq_agent_window_active(rq, time))
    util = max(util, rq->agent_min_util);

util floor

TARGET 04 / 07

cpuidle Exit-Latency Guard

drivers/cpuidle/cpuidle.c → cpuidle_enter_state()

Frequency is only half the story. A CPU at max frequency in C6 still pays 200–500 µs on the next IRQ. The patch adds a post-governor veto: after the governor selects a C-state, walk down to the shallowest state whose exit_latency ≤ AGENT_LATENCY_IDLE_EXIT_LIMIT_US. Deep idle is not disabled globally — only during the bounded window.

C-state veto

while (index > 0 &&
       drv->states[index].exit_latency > LIMIT_US)
    index--;  /* walk down to C1 or C0 */

C-state cap

TARGET 05 / 07

blk-mq and io_uring

block/blk-mq.c · io_uring/io_uring.c

Agents often depend on local storage: vector indexes, local DBs, tokenizer files, cached documents. Step-gating reads — a small random read that determines when the next step starts — are currently indistinguishable from background I/O scans. The policy direction: when a request carries an agent-critical tag, prefer latency over batching and refresh the agent window on completion. Phase 1 is attribution (making the read visible); policy follows once the data is collected.

step-gating reads

TARGET 06 / 07

MM and Page Cache

mm/filemap.c · mm/vmscan.c

Agents repeatedly touch mmap'd indexes, embedding tables, and document caches — but also do large cold scans during retrieval. These look identical to the reclaim subsystem, which will evict the hot working set to make room for the scan. The proposed future hints separate the two cases:

hot

MADV_AGENT_HOTSET

Small active set — resist eviction; likely touched in next step.

cold

MADV_AGENT_SCAN

Large cold traversal — don't evict the hotset to accommodate this.

measurement first

TARGET 07 / 07

VFS Metadata

fs/namei.c · fs/dcache.c

Coding and retrieval agents are metadata-heavy: openat(), statx(), getdents64(), path lookup, dentry lookup, inode fetch. Repeated cold dentry lookups across a deep repository tree add step-by-step overhead invisible in any single trace. The optimization direction is identifying metadata hotsets and reducing repeated cold lookups — not bypassing the VFS. Phase 1 is observability; the current repo makes the path visible.

dentry hotset

05 — Enforcement

The Budget Layer:
Safety Without Compromise

Any mechanism that keeps CPUs warm, limits deep idle, or prioritizes completions can be abused. If every process claims "I am latency-critical," the system loses fairness, power efficiency, and thermal headroom. The patch series addresses this directly with a cgroup token-bucket model — the last and most important piece of the design.

Fig 2 — Cgroup token bucket: bounded latency control

▲ The token bucket prevents privilege escalation: sustained high-frequency agents deplete their budget and gracefully fall back to standard power management.

C — cgroup budget accounting (conceptual)

/* Called at each agent window arm — before granting the window. */
static bool agent_budget_check_and_debit(struct cgroup_agent_budget *bgt)
{
    u64 now = ktime_get_ns();
    u64 refill;

    /* Refill tokens based on elapsed time */
    refill = min((now - bgt->last_refill_ns) * bgt->refill_rate_per_ns,
                 bgt->max_tokens - bgt->tokens);
    bgt->tokens        += refill;
    bgt->last_refill_ns = now;

    if (bgt->tokens < bgt->cost_per_window)
        return false;  /* budget exhausted: degrade gracefully */

    bgt->tokens -= bgt->cost_per_window;
    return true;  /* budget available: grant full window */
}

This transforms the agent window from a hint into a bounded control system. A well-behaved agent with a moderate step rate maintains its full budget indefinitely. A misbehaving agent that tries to hold the CPU warm continuously depletes its tokens and is throttled to standard reactive policy — without affecting any other cgroup.

06 — Anti-Patterns

What Not
to Do

The easy solutions are blunt and dangerous. They improve a benchmark while degrading the system for everyone else, and they don't survive production deployment.

Run the agent thread as SCHED_FIFO 99 — starves other tasks, produces priority inversion
Use the performance governor globally — wastes 15–40% system power continuously
Disable deep idle system-wide — destroys laptop battery life and server rack efficiency
Mark every I/O request from the agent process as high-priority — background indexing becomes latency-critical
Bypass Linux policy entirely (VFIO, bare-metal busy-poll) — loses all multi-tenant fairness

The correct model is the opposite of all of these:

Specific signal — only agent threads that opt in via sched_setattr() with SCHED_FLAG_AGENT_LATENCY
Narrow window — expires automatically; no manual disarming required
Measurable effect — every decision point is instrumented with a tracepoint
Budgeted enforcement — cgroup token bucket prevents sustained power override

Design invariant. The agent window must be strictly time-bounded. Any mechanism that can be held open indefinitely by userspace becomes a global power governor override — which is exactly the problem we set out to fix.

07 — Roadmap

A Practical
Six-Phase Plan

The patch series follows a deliberate order: measurement before policy. No optimization is proposed without first making the slowpath visible and attributable. This is the sequence:

Phase	Goal	Kernel mechanism	Status
1Observe	Measure every slowpath in isolation	tracepoints + eBPF bpftrace scripts	done
2Attribute	Connect events to agent steps	`agent_step_id` / `user_data` io_uring convention	done
3Preserve warmth	Avoid repeated cold starts	DVFS window + cpuidle guard (Patch 0000)	RFC
4Place intelligently	Wake on the right CPU	scheduler locality hint + IRQ-hot avoidance	planned
5Prioritize narrowly	Protect only step-gating work	blk-mq request class · MM hotset hints · VFS dentry	planned
6Enforce budgets	Prevent abuse; maintain fairness	cgroup token bucket per-agent	planned

Fig 3 — The agentic slowpath control plane: bounded, coordinated, expiring

▲ All subsystems read the same two rq fields in O(1). The budget layer sits beneath all of them — a single enforcement point for the entire control plane.

08 — What's Next

Part 3 Needs
Data, Not Architecture

A credible Part 3 should not add more design. It should add numbers — answering the questions the design raises but cannot answer without empirical measurement:

open questions — to be answered with data

/* How often does DVFS ramp occur after agent wakeups? */
/* How much wakeup-to-run latency is due to scheduler placement? */
/* How much deferred-completion delay comes from workqueues/timers? */
/* How often do page faults or VFS metadata lookups gate agent steps? */
/* How much does the cgroup budget reduce abuse while preserving p99? */
/* What is the energy cost of the agent window (Watts vs. baseline)? */

That is the line between a design note and a systems paper. The repo's docs/measurement-plan.md defines the experimental setup, eBPF collection methodology, and statistical requirements for each question. The selftest in selftests/agent_loop.c provides the controlled workload to answer them.

"The bottleneck is not just the model. It is the system between iterations."

— Patch 0000 cover letter, kernel-dvfs-agentic-latency

The new signal

The old optimization signal for kernel scheduling and power management was a single dimension: utilization. The signal the agentic kernel needs is three-dimensional:

new kernel signal — three dimensions

/* old */
signal = utilization

/* new */
signal = {
    utilization,         /* what PELT measures today            */
    temporal_intent,     /* "I will wake again within window_ns" */
    budget,              /* "I have N tokens of latency credit"  */
}

Agentic workloads are the first mainstream workload class where temporal intent — the knowledge that a wakeup is not isolated but part of a recurring sequence — is both predictable and actionable. The kernel has always had utilization. What it has never had is a way to say: this CPU will be needed again soon, keep it ready, but stop when the loop does.

What Part 1Established

Reactive is Wrongfor Control Loops

The Same Loop,Two Kernels

Seven Subsystems,One Control Signal

The Budget Layer:Safety Without Compromise

What Notto Do

A PracticalSix-Phase Plan

Part 3 NeedsData, Not Architecture

The new signal

What Part 1
Established

Reactive is Wrong
for Control Loops

The Same Loop,
Two Kernels

Seven Subsystems,
One Control Signal

The Budget Layer:
Safety Without Compromise

What Not
to Do

A Practical
Six-Phase Plan

Part 3 Needs
Data, Not Architecture