AI Systems · Memory Infrastructure · Runtime Design

The Memory Scheduler Is the New Critical Path in AI Inference

Published Apr 10, 2026 · 12 min read

When hardware coherency steps back, software scheduling steps forward. As AI machines move from transparent coherence to explicit data movement, the memory scheduler quietly becomes the most performance-critical component in the stack — and most systems aren't treating it that way yet.

By MANISH AI ⏱ 11 min read April 10, 2026 Memory · DMA · Runtime · Scheduling

Context: This post builds on Why Cache Coherency Is the Wrong Default for AI Machines. If you haven't read that, the short version: AI machines are moving from hardware-maintained coherency toward explicit, software-scheduled data movement. This post explores what that shift demands from the software side.

~60%

Inference time spent on memory ops in memory-bound LLM serving

3–10×

Throughput improvement from good KV cache scheduling (vs naive)

<100μs

Scheduling decision budget before HBM stalls start hurting latency

Memory tiers a modern scheduler must manage: HBM · DRAM · CXL · NVMe

Part 01 The Premise: Coherency Made Scheduling Optional

In classical CPU systems, the hardware coherency protocol handled the hard problem: when thread A writes to address X, every other thread that holds a copy of X gets invalidated automatically. No scheduling required. The programming model was simple because hardware did the work invisibly.

That invisibility came with a cost — snoop traffic, invalidation messages, ownership churn — but for CPU working sets in the tens of megabytes, the economics were fine. The programmer got simplicity; the hardware paid the coordination tax.

AI machines are breaking this contract. As we established, large-model inference involves working sets in the hundreds of gigabytes, access patterns that are mostly read-heavy and often predictable, and economics where every byte of interconnect bandwidth spent on coherency is a byte not spent on tensors. So the hardware is stepping back. And when hardware steps back, software has to step up.

Hardware coherency was free scheduling. When it goes away, you need a real scheduler.

The memory scheduler is that real scheduler. And unlike CPU schedulers — which mostly manage compute time slices — AI memory schedulers must manage placement, movement, eviction, and prefetch across a multi-tier memory hierarchy in real time, while the inference engine is mid-compute and starvation costs you tokens-per-second.

Part 02 What an AI Memory Scheduler Actually Does

Let's be concrete. A production AI memory scheduler is responsible for:

// Placement decisions

Where does each tensor live at each point in time? Which layers are in HBM? Which are staged in DRAM? Which are cold on NVMe? And crucially — how does that change dynamically as requests arrive, context lengths grow, and batch composition shifts?

// Prefetch decisions

What should be loaded next, and when should the DMA start? The scheduler must predict future compute needs far enough in advance to hide the latency of the slowest tier (NVMe: ~100μs, DRAM→HBM: ~10μs), without over-prefetching and wasting precious HBM capacity.

// Eviction decisions

What gets kicked out of HBM when it's full? LRU is a start, but in AI workloads, recency is often a bad predictor of future need. Smarter schedulers use layer ordering, request lifetime estimates, and priority to make eviction decisions that are much more accurate than cache-style policies.

// Movement orchestration

DMA engines, NVLink transfers, PCIe copies — these are all resources with contention. The scheduler must sequence and prioritize movement operations, overlap them with compute where possible, and avoid oversubscribing transfer bandwidth that is already the primary bottleneck.

Notice that none of this exists in a CPU system. The L1/L2/L3 cache hierarchy manages itself. DRAM is demand-fetched. The programmer never thinks about it. In AI systems, all of this is now your problem — or more precisely, the runtime's problem.

Fig 1 — AI Memory Scheduler: Decision Points and Data Flows

The memory scheduler sits between the inference engine and the multi-tier memory hierarchy. It receives layer requests from the engine, issues DMA commands to the copy engines, and must balance placement, prefetch, eviction, and bandwidth budget in real time — all under sub-100μs latency constraints.

Part 03 Three Hard Problems the Scheduler Must Solve

// 01. Predicting future memory need with incomplete information

The scheduler must prefetch data before it's needed, which means it must predict what will be needed and when. For a single static model serving single-length requests, this is easy — layer ordering is fixed, compute time per layer is deterministic. But in production:

Context lengths vary wildly — a 512-token request has radically different KV memory needs than a 128K-token request in the same batch.
MoE routing is dynamic — in Mixture of Experts models, which expert shard gets activated depends on the input, not the model architecture. The scheduler doesn't know until the routing decision is made.
Speculative decoding creates draft/verify divergence that can invalidate prefetch assumptions mid-flight.

State-of-the-art schedulers handle this with a combination of conservative prefetch horizons, priority-based preemption when predictions are wrong, and request metadata (estimated output length, task type) to guide placement decisions.

// 02. Contention for limited HBM capacity

HBM is fast but finite. On an H100, 80GB sounds like a lot — until you're serving a 70B parameter model (70GB weights alone) plus KV cache for a batch of concurrent 128K-context requests (easily another 40–80GB). You're immediately over capacity. The scheduler must decide what to evict, and those decisions directly translate to latency spikes or throughput drops.

Pure LRU eviction is dangerous here because it doesn't account for reuse distance. A layer that was accessed five seconds ago might be needed again in 200ms. An activation that was accessed 50ms ago might never be needed again. The scheduler needs reuse-distance awareness, not just recency.

// 03. Bandwidth contention between movement and compute

Every DMA transfer competes with the GPU's own memory access patterns on the HBM bus. If the scheduler issues a large prefetch exactly when the GPU is in a bandwidth-intensive attention pass, they compete for the same memory controllers. The best schedulers model the GPU's bandwidth consumption curve and schedule movement during compute-bound phases (like MLP layers with high arithmetic intensity) rather than memory-bound phases (like attention over long sequences).

The dangerous assumption: many existing systems treat memory management as a solved problem inherited from operating systems — LRU, demand paging, transparent. For AI inference at scale, this assumption can cost 40–60% of potential throughput.

Part 04 KV Cache Management: The Scheduler's Hardest Case

The KV cache is where memory scheduling gets genuinely difficult, and where the wins from doing it well are largest.

Unlike model weights (which are static during inference), the KV cache is dynamic. It grows with every generated token. It's private to a request but shared across a request's decoding steps. Its lifetime is the request's lifetime — unpredictable, varying from milliseconds to minutes. And in a serving system handling thousands of concurrent requests, the aggregate KV cache can easily dwarf the model weights.

PagedAttention (vLLM's key insight) was fundamentally a scheduling insight, not an attention insight: stop treating KV cache as contiguous, allocate it in pages, and let the scheduler manage fragmentation.

The paging insight unlocked 3–10× throughput improvements in practice, not because attention got faster, but because the memory scheduler got smarter about placement and fragmentation. This is the template for what AI memory scheduling needs to become more broadly.

Fig 2 — KV Cache Lifecycle Under Scheduler Control

KV cache management is fundamentally a scheduling problem. The scheduler allocates pages at request arrival, tracks prefix reuse opportunities, decides when to offload to DRAM under capacity pressure, and frees pages at completion. Each decision has direct throughput consequences.

Part 05 What Hardware Needs to Provide

Good software schedulers are necessary but not sufficient. The hardware also has to change to make schedulers effective. Three things matter most:

// Priority-aware DMA engines

Current DMA engines are mostly FIFO. For a memory scheduler to be effective, it needs to be able to preempt lower-priority transfers in favor of higher-priority ones — for example, abort a speculative prefetch when the model's MoE routing takes an unexpected path and a different expert shard is needed immediately.

// Bandwidth visibility

The scheduler needs real-time visibility into HBM bandwidth utilization to make good prefetch timing decisions. If it can't see that the GPU's attention kernel is currently saturating memory bandwidth, it can't make the right call about when to schedule DMA transfers to avoid contention.

// Tiered addressing without coherency

The hardware should expose the multi-tier memory hierarchy in a way that lets software manage placement explicitly, without forcing everything through a coherency protocol. CXL.mem (without mandating CXL.cache semantics for bulk tensors) is the right model here: give the scheduler visibility into multiple memory tiers with explicit movement primitives, but don't impose coherency on data that doesn't need it.

Hardware feature

Scheduler benefit

Current state

Priority-aware DMA

Preempt wrong-path prefetches in MoE

Mostly unavailable; workarounds exist

HBM BW telemetry

Schedule movement during compute-bound phases

Partial via PMU; needs richer API

CXL.mem without CXL.cache

Expand capacity without coherency overhead

Available in CXL 3.0+; adoption growing

Async memory operations

Overlap compute with staging for upcoming layers

Present in CUDA/ROCm; improving

Page-granular eviction hints

Let scheduler inform hardware about future access

Limited; mainly via madvise-style APIs

Part 06 The Future: Compilers and Schedulers Co-designing Memory

The long-term direction is a tighter coupling between the model compiler, the runtime scheduler, and the hardware.

Today, most of this is ad-hoc: the runtime observes what the model is doing and reacts. A more sophisticated approach is for the model compiler to emit movement hints alongside compute operations — essentially a "memory program" that the scheduler executes in parallel with the compute program. This is how CUDA's cooperative memory management features are trending, and it's what future AI-specific compiler backends will likely need to produce.

The winning inference stack will treat memory scheduling as a first-class compilation target, not a runtime afterthought.

In this model, the scheduler is not just a reactive component that manages demand misses. It is a co-equal runtime that executes a memory plan in parallel with compute, with full visibility into the model's structure, the hardware's topology, and the current request mix. That's a fundamentally different programming model than anything we inherited from CPUs — and it's where the highest-performing inference systems of the next few years will live.

The parallel to classical OS scheduling: when CPUs got fast enough that manual assembly scheduling became impractical, we invented OS schedulers and compilers. AI memory is at that same inflection point. The hand-crafted DMA sequences of today's inference kernels will be replaced by principled compiler-generated memory programs.

Conclusion

Cache coherency was a gift: it made the hard problem of memory consistency disappear into hardware. AI machines are unwrapping that gift and finding the bill inside. When hardware stops managing consistency for you, you need a real memory scheduler — one that handles placement, prefetch, eviction, movement, and contention as first-class concerns.

The memory scheduler is not a new component. It's an old component that finally needs to grow up. PagedAttention was its first major milestone. Hardware-software co-design for tiered memory is its next. Compiler-generated memory programs are its future.

The systems that get this right first will have a durable inference efficiency advantage — not because they found a new algorithm, but because they treated memory scheduling as the performance-critical primitive it has always been.

Core takeaway: As AI machines abandon universal coherency, software must pick up the slack. The memory scheduler is the most under-appreciated critical path in modern inference infrastructure — and the teams that invest in it early will compound the advantage.

Part 01 The Premise: Coherency Made Scheduling Optional

Part 02 What an AI Memory Scheduler Actually Does

// Placement decisions

// Prefetch decisions

// Eviction decisions

// Movement orchestration

Part 03 Three Hard Problems the Scheduler Must Solve

// 01. Predicting future memory need with incomplete information

// 02. Contention for limited HBM capacity

// 03. Bandwidth contention between movement and compute

Part 04 KV Cache Management: The Scheduler's Hardest Case

Part 05 What Hardware Needs to Provide

// Priority-aware DMA engines

// Bandwidth visibility

// Tiered addressing without coherency

Part 06 The Future: Compilers and Schedulers Co-designing Memory

Conclusion

// Related reading

Why Cache Coherency Is the Wrong Default for AI Machines

Bandwidth Is the Budget: How to Think About Memory Economics in AI Systems