AI Systems · Memory Infrastructure · Runtime Design

The Memory Scheduler Is the New Critical Path in AI Inference

When hardware coherency steps back, software scheduling steps forward. As AI machines move from transparent coherence to explicit data movement, the memory scheduler quietly becomes the most performance-critical component in the stack — and most systems aren't treating it that way yet.

⏱ 11 min read Memory · DMA · Runtime · Scheduling
Context: This post builds on Why Cache Coherency Is the Wrong Default for AI Machines. If you haven't read that, the short version: AI machines are moving from hardware-maintained coherency toward explicit, software-scheduled data movement. This post explores what that shift demands from the software side.
~60%
Inference time spent on memory ops in memory-bound LLM serving
3–10×
Throughput improvement from good KV cache scheduling (vs naive)
<100μs
Scheduling decision budget before HBM stalls start hurting latency
~4
Memory tiers a modern scheduler must manage: HBM · DRAM · CXL · NVMe

Part 01 The Premise: Coherency Made Scheduling Optional

In classical CPU systems, the hardware coherency protocol handled the hard problem: when thread A writes to address X, every other thread that holds a copy of X gets invalidated automatically. No scheduling required. The programming model was simple because hardware did the work invisibly.

That invisibility came with a cost — snoop traffic, invalidation messages, ownership churn — but for CPU working sets in the tens of megabytes, the economics were fine. The programmer got simplicity; the hardware paid the coordination tax.

AI machines are breaking this contract. As we established, large-model inference involves working sets in the hundreds of gigabytes, access patterns that are mostly read-heavy and often predictable, and economics where every byte of interconnect bandwidth spent on coherency is a byte not spent on tensors. So the hardware is stepping back. And when hardware steps back, software has to step up.

Hardware coherency was free scheduling. When it goes away, you need a real scheduler.

The memory scheduler is that real scheduler. And unlike CPU schedulers — which mostly manage compute time slices — AI memory schedulers must manage placement, movement, eviction, and prefetch across a multi-tier memory hierarchy in real time, while the inference engine is mid-compute and starvation costs you tokens-per-second.

Part 02 What an AI Memory Scheduler Actually Does

Let's be concrete. A production AI memory scheduler is responsible for:

// Placement decisions

Where does each tensor live at each point in time? Which layers are in HBM? Which are staged in DRAM? Which are cold on NVMe? And crucially — how does that change dynamically as requests arrive, context lengths grow, and batch composition shifts?

// Prefetch decisions

What should be loaded next, and when should the DMA start? The scheduler must predict future compute needs far enough in advance to hide the latency of the slowest tier (NVMe: ~100μs, DRAM→HBM: ~10μs), without over-prefetching and wasting precious HBM capacity.

// Eviction decisions

What gets kicked out of HBM when it's full? LRU is a start, but in AI workloads, recency is often a bad predictor of future need. Smarter schedulers use layer ordering, request lifetime estimates, and priority to make eviction decisions that are much more accurate than cache-style policies.

// Movement orchestration

DMA engines, NVLink transfers, PCIe copies — these are all resources with contention. The scheduler must sequence and prioritize movement operations, overlap them with compute where possible, and avoid oversubscribing transfer bandwidth that is already the primary bottleneck.

Notice that none of this exists in a CPU system. The L1/L2/L3 cache hierarchy manages itself. DRAM is demand-fetched. The programmer never thinks about it. In AI systems, all of this is now your problem — or more precisely, the runtime's problem.

Fig 1 — AI Memory Scheduler: Decision Points and Data Flows
Inference Engine Attention · MLP · Sampling · Decoding Memory Scheduler placement · prefetch · eviction · movement orchestration layer order oracle · KV lifetime estimator · DMA sequencer bandwidth budget tracker · tier capacity monitor HBM 80GB · 3.35TB/s active working set Host DRAM 1–2TB · ~50GB/s staging + prefetch CXL.mem 2–8TB · ~40GB/s expanded capacity NVMe 10–100TB · ~10GB/s cold weight storage layer requests / completions DMA Engine copy engines NVLink / PCIe issue transfers Request Queue context lengths batch priorities hints + metadata
The memory scheduler sits between the inference engine and the multi-tier memory hierarchy. It receives layer requests from the engine, issues DMA commands to the copy engines, and must balance placement, prefetch, eviction, and bandwidth budget in real time — all under sub-100μs latency constraints.

Part 03 Three Hard Problems the Scheduler Must Solve

// 01. Predicting future memory need with incomplete information

The scheduler must prefetch data before it's needed, which means it must predict what will be needed and when. For a single static model serving single-length requests, this is easy — layer ordering is fixed, compute time per layer is deterministic. But in production:

State-of-the-art schedulers handle this with a combination of conservative prefetch horizons, priority-based preemption when predictions are wrong, and request metadata (estimated output length, task type) to guide placement decisions.

// 02. Contention for limited HBM capacity

HBM is fast but finite. On an H100, 80GB sounds like a lot — until you're serving a 70B parameter model (70GB weights alone) plus KV cache for a batch of concurrent 128K-context requests (easily another 40–80GB). You're immediately over capacity. The scheduler must decide what to evict, and those decisions directly translate to latency spikes or throughput drops.

Pure LRU eviction is dangerous here because it doesn't account for reuse distance. A layer that was accessed five seconds ago might be needed again in 200ms. An activation that was accessed 50ms ago might never be needed again. The scheduler needs reuse-distance awareness, not just recency.

// 03. Bandwidth contention between movement and compute

Every DMA transfer competes with the GPU's own memory access patterns on the HBM bus. If the scheduler issues a large prefetch exactly when the GPU is in a bandwidth-intensive attention pass, they compete for the same memory controllers. The best schedulers model the GPU's bandwidth consumption curve and schedule movement during compute-bound phases (like MLP layers with high arithmetic intensity) rather than memory-bound phases (like attention over long sequences).

The dangerous assumption: many existing systems treat memory management as a solved problem inherited from operating systems — LRU, demand paging, transparent. For AI inference at scale, this assumption can cost 40–60% of potential throughput.

Part 04 KV Cache Management: The Scheduler's Hardest Case

The KV cache is where memory scheduling gets genuinely difficult, and where the wins from doing it well are largest.

Unlike model weights (which are static during inference), the KV cache is dynamic. It grows with every generated token. It's private to a request but shared across a request's decoding steps. Its lifetime is the request's lifetime — unpredictable, varying from milliseconds to minutes. And in a serving system handling thousands of concurrent requests, the aggregate KV cache can easily dwarf the model weights.

PagedAttention (vLLM's key insight) was fundamentally a scheduling insight, not an attention insight: stop treating KV cache as contiguous, allocate it in pages, and let the scheduler manage fragmentation.

The paging insight unlocked 3–10× throughput improvements in practice, not because attention got faster, but because the memory scheduler got smarter about placement and fragmentation. This is the template for what AI memory scheduling needs to become more broadly.

Fig 2 — KV Cache Lifecycle Under Scheduler Control
Request lifetime → t=0 arrive prefill done mid-decode complete KV cache (grows with each generated token) HBM capacity ALLOCATE pages PREFIX HIT OFFLOAD → DRAM PREEMPT + SWAP FREE pages Allocate/free Prefix cache hit (reuse) Offload to DRAM (keep alive) Preempt + swap (capacity pressure)
KV cache management is fundamentally a scheduling problem. The scheduler allocates pages at request arrival, tracks prefix reuse opportunities, decides when to offload to DRAM under capacity pressure, and frees pages at completion. Each decision has direct throughput consequences.

Part 05 What Hardware Needs to Provide

Good software schedulers are necessary but not sufficient. The hardware also has to change to make schedulers effective. Three things matter most:

// Priority-aware DMA engines

Current DMA engines are mostly FIFO. For a memory scheduler to be effective, it needs to be able to preempt lower-priority transfers in favor of higher-priority ones — for example, abort a speculative prefetch when the model's MoE routing takes an unexpected path and a different expert shard is needed immediately.

// Bandwidth visibility

The scheduler needs real-time visibility into HBM bandwidth utilization to make good prefetch timing decisions. If it can't see that the GPU's attention kernel is currently saturating memory bandwidth, it can't make the right call about when to schedule DMA transfers to avoid contention.

// Tiered addressing without coherency

The hardware should expose the multi-tier memory hierarchy in a way that lets software manage placement explicitly, without forcing everything through a coherency protocol. CXL.mem (without mandating CXL.cache semantics for bulk tensors) is the right model here: give the scheduler visibility into multiple memory tiers with explicit movement primitives, but don't impose coherency on data that doesn't need it.

Hardware feature
Scheduler benefit
Current state
Priority-aware DMA
Preempt wrong-path prefetches in MoE
Mostly unavailable; workarounds exist
HBM BW telemetry
Schedule movement during compute-bound phases
Partial via PMU; needs richer API
CXL.mem without CXL.cache
Expand capacity without coherency overhead
Available in CXL 3.0+; adoption growing
Async memory operations
Overlap compute with staging for upcoming layers
Present in CUDA/ROCm; improving
Page-granular eviction hints
Let scheduler inform hardware about future access
Limited; mainly via madvise-style APIs

Part 06 The Future: Compilers and Schedulers Co-designing Memory

The long-term direction is a tighter coupling between the model compiler, the runtime scheduler, and the hardware.

Today, most of this is ad-hoc: the runtime observes what the model is doing and reacts. A more sophisticated approach is for the model compiler to emit movement hints alongside compute operations — essentially a "memory program" that the scheduler executes in parallel with the compute program. This is how CUDA's cooperative memory management features are trending, and it's what future AI-specific compiler backends will likely need to produce.

The winning inference stack will treat memory scheduling as a first-class compilation target, not a runtime afterthought.

In this model, the scheduler is not just a reactive component that manages demand misses. It is a co-equal runtime that executes a memory plan in parallel with compute, with full visibility into the model's structure, the hardware's topology, and the current request mix. That's a fundamentally different programming model than anything we inherited from CPUs — and it's where the highest-performing inference systems of the next few years will live.

The parallel to classical OS scheduling: when CPUs got fast enough that manual assembly scheduling became impractical, we invented OS schedulers and compilers. AI memory is at that same inflection point. The hand-crafted DMA sequences of today's inference kernels will be replaced by principled compiler-generated memory programs.

Conclusion

Cache coherency was a gift: it made the hard problem of memory consistency disappear into hardware. AI machines are unwrapping that gift and finding the bill inside. When hardware stops managing consistency for you, you need a real memory scheduler — one that handles placement, prefetch, eviction, movement, and contention as first-class concerns.

The memory scheduler is not a new component. It's an old component that finally needs to grow up. PagedAttention was its first major milestone. Hardware-software co-design for tiered memory is its next. Compiler-generated memory programs are its future.

The systems that get this right first will have a durable inference efficiency advantage — not because they found a new algorithm, but because they treated memory scheduling as the performance-critical primitive it has always been.

Core takeaway: As AI machines abandon universal coherency, software must pick up the slack. The memory scheduler is the most under-appreciated critical path in modern inference infrastructure — and the teams that invest in it early will compound the advantage.

// Related reading

Previous in series

Why Cache Coherency Is the Wrong Default for AI Machines

The foundational argument for why AI machines are moving away from universal coherency.