When hardware coherency steps back, software scheduling steps forward. As AI machines move from transparent coherence to explicit data movement, the memory scheduler quietly becomes the most performance-critical component in the stack — and most systems aren't treating it that way yet.
In classical CPU systems, the hardware coherency protocol handled the hard problem: when thread A writes to address X, every other thread that holds a copy of X gets invalidated automatically. No scheduling required. The programming model was simple because hardware did the work invisibly.
That invisibility came with a cost — snoop traffic, invalidation messages, ownership churn — but for CPU working sets in the tens of megabytes, the economics were fine. The programmer got simplicity; the hardware paid the coordination tax.
AI machines are breaking this contract. As we established, large-model inference involves working sets in the hundreds of gigabytes, access patterns that are mostly read-heavy and often predictable, and economics where every byte of interconnect bandwidth spent on coherency is a byte not spent on tensors. So the hardware is stepping back. And when hardware steps back, software has to step up.
The memory scheduler is that real scheduler. And unlike CPU schedulers — which mostly manage compute time slices — AI memory schedulers must manage placement, movement, eviction, and prefetch across a multi-tier memory hierarchy in real time, while the inference engine is mid-compute and starvation costs you tokens-per-second.
Let's be concrete. A production AI memory scheduler is responsible for:
Where does each tensor live at each point in time? Which layers are in HBM? Which are staged in DRAM? Which are cold on NVMe? And crucially — how does that change dynamically as requests arrive, context lengths grow, and batch composition shifts?
What should be loaded next, and when should the DMA start? The scheduler must predict future compute needs far enough in advance to hide the latency of the slowest tier (NVMe: ~100μs, DRAM→HBM: ~10μs), without over-prefetching and wasting precious HBM capacity.
What gets kicked out of HBM when it's full? LRU is a start, but in AI workloads, recency is often a bad predictor of future need. Smarter schedulers use layer ordering, request lifetime estimates, and priority to make eviction decisions that are much more accurate than cache-style policies.
DMA engines, NVLink transfers, PCIe copies — these are all resources with contention. The scheduler must sequence and prioritize movement operations, overlap them with compute where possible, and avoid oversubscribing transfer bandwidth that is already the primary bottleneck.
Notice that none of this exists in a CPU system. The L1/L2/L3 cache hierarchy manages itself. DRAM is demand-fetched. The programmer never thinks about it. In AI systems, all of this is now your problem — or more precisely, the runtime's problem.
The scheduler must prefetch data before it's needed, which means it must predict what will be needed and when. For a single static model serving single-length requests, this is easy — layer ordering is fixed, compute time per layer is deterministic. But in production:
State-of-the-art schedulers handle this with a combination of conservative prefetch horizons, priority-based preemption when predictions are wrong, and request metadata (estimated output length, task type) to guide placement decisions.
HBM is fast but finite. On an H100, 80GB sounds like a lot — until you're serving a 70B parameter model (70GB weights alone) plus KV cache for a batch of concurrent 128K-context requests (easily another 40–80GB). You're immediately over capacity. The scheduler must decide what to evict, and those decisions directly translate to latency spikes or throughput drops.
Pure LRU eviction is dangerous here because it doesn't account for reuse distance. A layer that was accessed five seconds ago might be needed again in 200ms. An activation that was accessed 50ms ago might never be needed again. The scheduler needs reuse-distance awareness, not just recency.
Every DMA transfer competes with the GPU's own memory access patterns on the HBM bus. If the scheduler issues a large prefetch exactly when the GPU is in a bandwidth-intensive attention pass, they compete for the same memory controllers. The best schedulers model the GPU's bandwidth consumption curve and schedule movement during compute-bound phases (like MLP layers with high arithmetic intensity) rather than memory-bound phases (like attention over long sequences).
The KV cache is where memory scheduling gets genuinely difficult, and where the wins from doing it well are largest.
Unlike model weights (which are static during inference), the KV cache is dynamic. It grows with every generated token. It's private to a request but shared across a request's decoding steps. Its lifetime is the request's lifetime — unpredictable, varying from milliseconds to minutes. And in a serving system handling thousands of concurrent requests, the aggregate KV cache can easily dwarf the model weights.
The paging insight unlocked 3–10× throughput improvements in practice, not because attention got faster, but because the memory scheduler got smarter about placement and fragmentation. This is the template for what AI memory scheduling needs to become more broadly.
Good software schedulers are necessary but not sufficient. The hardware also has to change to make schedulers effective. Three things matter most:
Current DMA engines are mostly FIFO. For a memory scheduler to be effective, it needs to be able to preempt lower-priority transfers in favor of higher-priority ones — for example, abort a speculative prefetch when the model's MoE routing takes an unexpected path and a different expert shard is needed immediately.
The scheduler needs real-time visibility into HBM bandwidth utilization to make good prefetch timing decisions. If it can't see that the GPU's attention kernel is currently saturating memory bandwidth, it can't make the right call about when to schedule DMA transfers to avoid contention.
The hardware should expose the multi-tier memory hierarchy in a way that lets software manage placement explicitly, without forcing everything through a coherency protocol. CXL.mem (without mandating CXL.cache semantics for bulk tensors) is the right model here: give the scheduler visibility into multiple memory tiers with explicit movement primitives, but don't impose coherency on data that doesn't need it.
The long-term direction is a tighter coupling between the model compiler, the runtime scheduler, and the hardware.
Today, most of this is ad-hoc: the runtime observes what the model is doing and reacts. A more sophisticated approach is for the model compiler to emit movement hints alongside compute operations — essentially a "memory program" that the scheduler executes in parallel with the compute program. This is how CUDA's cooperative memory management features are trending, and it's what future AI-specific compiler backends will likely need to produce.
In this model, the scheduler is not just a reactive component that manages demand misses. It is a co-equal runtime that executes a memory plan in parallel with compute, with full visibility into the model's structure, the hardware's topology, and the current request mix. That's a fundamentally different programming model than anything we inherited from CPUs — and it's where the highest-performing inference systems of the next few years will live.
Cache coherency was a gift: it made the hard problem of memory consistency disappear into hardware. AI machines are unwrapping that gift and finding the bill inside. When hardware stops managing consistency for you, you need a real memory scheduler — one that handles placement, prefetch, eviction, movement, and contention as first-class concerns.
The memory scheduler is not a new component. It's an old component that finally needs to grow up. PagedAttention was its first major milestone. Hardware-software co-design for tiered memory is its next. Compiler-generated memory programs are its future.
The systems that get this right first will have a durable inference efficiency advantage — not because they found a new algorithm, but because they treated memory scheduling as the performance-critical primitive it has always been.