The Compiler Becomes
the Memory Scheduler
Runtime dynamism is becoming too expensive for AI inference at scale. The next generation of AI compilers will not just optimize computation — they will pre-plan every byte's journey through the memory hierarchy before execution begins.
The old assumption is breaking
For decades, the compute stack was built on a comfortable foundation: the runtime can decide things dynamically, and that's fine. Dynamic allocation, dynamic scheduling, dynamic dispatch, dynamic paging.
This worked because CPUs were balanced systems — memory access was slow relative to compute, but predictably so. Workloads were irregular and bursty, so static planning was often impossible. Latency budgets were forgiving. The runtime had time to think.
AI inference changes all three assumptions simultaneously. Memory bandwidth is now the binding constraint, not compute. Workloads run billions of near-identical iterations. Latency budgets have collapsed to sub-millisecond SLAs. The runtime no longer has time to think.
The result is a quiet revolution in how AI systems are compiled, scheduled, and executed. Not in the math — that's still GEMM on tensors. But in the orchestration layer underneath. The compiler is evolving from a code generator into a memory traffic planner, and the implications reach from kernel scheduling all the way to cluster topology.
The real problem is memory traffic
The industry still measures AI hardware in teraFLOPs. The actual bottleneck in production inference systems is increasingly something different: how fast bytes move, not how fast multiplications happen.
Step through a single transformer decode iteration on a large model and list what must happen before the matrix multiply can start:
# One transformer decode step — the unglamorous full picture 1. Prefetch weights for layer N+1 # from HBM or CXL 2. DMA KV cache block for this request # from DRAM or CXL pool 3. Stage activations from previous layer # HBM → compute cache 4. Wait on synchronization fences # stream ordering 5. ⚡ GEMM / attention compute # the part everyone counts 6. Write new KV entries back # to HBM / CXL 7. Stream output tokens # to decode buffer # Steps 1–4 and 6–7 are memory traffic. Step 5 is compute.
A model can be entirely arithmetic-bound in isolation on one layer — and still be memory-bound system-wide across all layers, because the coordination between layers costs more than the computation inside them.
Why runtime dynamism gets expensive
Traditional runtimes make decisions just-in-time. This is fine when decisions are cheap and errors are recoverable. In AI inference at scale, both conditions fail.
The pipeline stalls, then recovers
kernel starts dependency missing runtime detects miss launch DMA stall waiting DMA completes resume # detection + dispatch = 10–50μs
The pipeline blocks on discovering the miss. The cost is not just the DMA time — it includes detection latency and dispatch latency.
Traffic is always ahead of compute
# compiler emitted at build time: t=N-2: DMA weights_L9 → HBM_slot4 t=N-2: DMA kv_block42 → HBM_kv0 t=N-1: fence(DMA_9, DMA_kv) t=N: exec attention_L9 # GPU never sees an empty pipeline
DMA is issued 2 steps early. By the time compute arrives at L9, the data is already there. Zero stall time.
Beyond individual stalls, runtime dynamism has systemic costs that compound:
Allocator fragmentation
Dynamic malloc/free fragments the HBM pool over time, causing increasingly expensive allocation searches and potential out-of-memory despite available space.
Synchronization overhead
Reactive fences and stream waits add per-operation overhead. At thousands of micro-operations per second, O(n) sync cost per handoff compounds to milliseconds.
CPU-side scheduling
Each kernel launch requires a CPU call (~10μs host overhead). For models with thousands of operations per forward pass, this alone adds ~10ms — entirely avoidable with graph capture.
Static execution planning
The key insight is simple: if the computation graph is mostly known before execution begins, memory movement can be planned before execution begins.
This is not a new idea in computing. It's the same principle that transformed compilers from interpreters to ahead-of-time code generators. The question is what "known ahead of time" means for AI inference — and the answer is: more than most people assume.
At inference time, the model architecture is fixed. The layer sequence is fixed. The tensor shapes are fixed (for a given batch/context size). What changes is the data — the token values, the KV cache contents. But the movement pattern of that data through the memory hierarchy is almost entirely static. The compiler can plan it.
| Layer | Old runtime behavior | Compiler-planned behavior | Savings |
|---|---|---|---|
| Memory allocation | Dynamic malloc/free per tensor | Static placement via liveness analysis | no fragmentation |
| Tensor movement | DMA triggered on cache miss | DMA issued N steps before needed | zero stall |
| Synchronization | Runtime fences on every handoff | Precomputed producer–consumer barriers | minimal fence overhead |
| Kernel launches | CPU-driven per launch (~10 μs each) | Captured graph, replayed from GPU | 10–100× lower launch latency |
| Memory tiering | Page on demand | Compile-time residency plan per tier | no page faults |
| Buffer reuse | Independent allocations per op | Overlapping lifetime → shared slots | lower peak HBM usage |
Tensor liveness analysis: register allocation for memory
The deepest idea in compiler-controlled memory management is that it is structurally identical to register allocation — one of the most studied problems in compiler theory. Registers are a finite, fast resource. HBM slots are a finite, fast resource. The algorithm for assigning variables to registers is the same algorithm for assigning tensors to HBM slots.
Liveness analysis determines the exact time interval during which each tensor must be in memory — its live range. Two tensors with non-overlapping live ranges can share the same physical memory slot, even if they're logically different buffers.
# Liveness analysis: the core compiler primitive class TensorLifetime: name: str size: int # bytes born: int # step when first written dies: int # step after last read tier: str # HBM | DRAM | CXL | SSD def can_alias(a: TensorLifetime, b: TensorLifetime) -> bool: # a dies before b is born → same slot is safe return a.dies <= b.born or b.dies <= a.born # Example graph: 32-layer transformer decode step lifetimes = [ TensorLifetime("weights_L7", size=512_MB, born=0, dies=2, tier="HBM"), TensorLifetime("weights_L8", size=512_MB, born=2, dies=4, tier="HBM"), TensorLifetime("attn_tmp_L7", size=64_MB, born=1, dies=2, tier="HBM"), TensorLifetime("attn_tmp_L8", size=64_MB, born=3, dies=4, tier="HBM"), TensorLifetime("mlp_tmp_L7", size=64_MB, born=2, dies=3, tier="HBM"), ] # Result: attn_tmp_L7, attn_tmp_L8, mlp_tmp_L7 can share one 64 MB slot # weights_L7 and weights_L8 can share one 512 MB slot # Peak HBM: 512+64 MB instead of 512+512+64+64+64 MB
This technique is not theoretical — XLA (Google's compiler for TPU/GPU) has used buffer assignment via liveness analysis since 2017. TVM applies it across heterogeneous memory tiers. MLIR's buffer placement passes generalize it to the multi-level IR ecosystem. What's changing is the scope: these techniques are now being extended to multi-device, multi-tier memory hierarchies including HBM, DRAM, CXL pools, and NVMe.
Compiler-scheduled DMA
DMA scheduling is where memory planning becomes a hardware–software contract. The compiler doesn't just decide where data lives — it decides when the DMA engine should move it, relative to the compute timeline.
Modern DMA controllers accept descriptor chains: pre-built lists of transfer operations with fence dependencies. The compiler can emit these chains as a static artifact, requiring minimal runtime intervention to execute.
// Compiler-emitted DMA descriptor chain (C struct style) struct dma_desc { uint64_t src_addr; // physical or CXL address uint64_t dst_addr; // HBM slot address uint32_t bytes; uint16_t stream_id; // which DMA engine handles this uint16_t dep_fence; // wait for this fence before starting uint16_t comp_fence; // signal this fence on completion }; // Emitted by compiler for one decode window: desc[0] = { CXL_WEIGHTS_L9, HBM_SLOT_A, 512_MB, S0, NONE, F1 } desc[1] = { DRAM_KV_BLOCK42, HBM_KV_0, 64_MB, S1, NONE, F2 } desc[2] = { EXEC: ATTENTION_L8, wait=[], sig=F3 } desc[3] = { EXEC: ATTENTION_L9, wait=[F1,F2] } desc[4] = { RELEASE: HBM_SLOT_A, after=F3 }
This is directly implemented in production systems. NVIDIA's cudaMemcpyAsync with explicit stream ordering is the low-level primitive. CUDA Graphs capture the dependency graph. Higher-level frameworks like Triton and FlashAttention use pipelining primitives to overlap loads and compute within a single kernel. The frontier is pushing this principle from single-kernel scope to whole-graph, multi-device scope.
Graph capture: the first visible step
CUDA Graphs are the clearest production sign of this transition already underway. Instead of launching kernels one by one from the CPU — incurring ~10 μs host overhead per launch — a workload is captured once and replayed from GPU memory.
# CPU launches every kernel individually for token in tokens: launch(q_proj, x) # ~10μs launch(k_proj, x) # ~10μs launch(v_proj, x) # ~10μs launch(attn, q, k, v) # ~10μs launch(mlp, y) # ~10μs # 50μs CPU overhead per token # 100 tokens → 5ms wasted CPU time
# Capture once: graph = cuda.capture(decode_step) # Replay per token (GPU-side, no CPU): for token in tokens: graph.replay(input_ptr, output_ptr) # ~0.5μs total overhead # 100 tokens → 50μs # 100× lower launch latency
But graph capture is a necessary stepping stone, not the destination. The deeper shift is: graph capture + tensor placement + DMA scheduling + bounded buffer reuse. CUDA Graphs give you the replay primitive. Liveness analysis gives you the placement plan. The compiler's job is to produce all three together.
This is what XLA does end-to-end for TPU workloads — the compilation step produces buffer assignments, DMA schedules, and kernel graphs as a single artifact. The runtime is nearly vestigial: it just executes the plan. The compiler made all the decisions.
Bounded buffers: inference as a ring pipeline
One of the most elegant properties of compiler-planned execution is that it enables bounded memory use. Instead of a runtime that grows, fragments, and occasionally panics with OOM, the compiler computes the maximum peak memory needed — provably — and allocates it up front as a fixed pool with named slots.
This is structurally identical to how network interface cards manage packet queues: a fixed ring buffer, where slots cycle through empty → owned → complete → empty with no dynamic allocation.
# What the compiler produces HBM_POOL = [ Slot(id=0, size=512_MB, tier="HBM"), # weights double-buffer A Slot(id=1, size=512_MB, tier="HBM"), # weights double-buffer B Slot(id=2, size=64_MB, tier="HBM"), # attention temp Slot(id=3, size=256_MB, tier="HBM"), # KV active window Slot(id=4, size=256_MB, tier="HBM"), # KV prefetch window ] # total HBM committed: 1600 MB — known at compile time, guaranteed sufficient # runtime never calls malloc()
This has a second-order benefit beyond avoiding fragmentation: it makes the memory footprint statically verifiable. A safety-critical deployment can prove at compile time that the model will never exceed its memory budget — no runtime surprises.
Deterministic execution windows
Determinism is not an academic nicety — it is an operational requirement for production AI systems. Non-deterministic memory movement creates tail latency, and tail latency in AI inference compounds across batches, agent loops, and service tiers in ways that are difficult to debug and expensive to provision for.
| Property | Dynamic runtime | Compiler-planned determinism |
|---|---|---|
| Latency profile | p50 good, p99 painful — driven by allocator and page-fault jitter | Bounded by compile-time worst-case analysis |
| Memory footprint | Grows over time due to fragmentation | Fixed by compiler allocation plan |
| DMA timing | Reactive — triggered by misses | Scheduled — issued N steps early |
| Debuggability | Latency spikes are hard to reproduce | Execution is replayable from descriptor log |
| Safety certification | Runtime behaviour cannot be fully pre-verified | Static analysis possible over compiled plan |
| SLA provisioning | Must over-provision for p99 headroom | Provision for known max, not statistical estimate |
In a 32-step agentic loop where each step has p99 latency 2× above p50, the probability of hitting at least one high-latency step is 1 - 0.99^32 ≈ 27%. Nearly one in three agent sessions hits a tail event. Compiler-planned determinism attacks this by compressing the gap between p50 and p99.
The full compiler pipeline
Putting the pieces together: a memory-aware AI compiler takes a model graph and produces not one but three artifacts — a compute graph, a memory traffic plan, and a synchronization schedule. These three together constitute the "deterministic execution plan" that a thin runtime can execute with minimal improvisation.
This is the architecture of XLA on TPUs today. The TPU runtime is deliberately thin — it executes the compiled plan and does almost nothing else. The insight is that a rich runtime is not a feature; it is deferred compilation, with all the associated unpredictability. Moving those decisions into the compiler makes them auditable, optimizable, and deterministic.
Hardware must cooperate
Compiler-controlled execution is only possible when hardware exposes the right interfaces. If the compiler emits a DMA schedule but the hardware has no mechanism to accept and hold that schedule, the optimization is unreachable. This is why the compiler story is inseparable from the hardware roadmap.
Explicit control surfaces
Programmable DMA engines that accept descriptor chains. Memory-tier visibility (HBM vs DRAM vs CXL). Compiler-addressable prefetch units. Named synchronization fences. Graph replay primitives. Residency hint interfaces.
Predictable workloads
If the compiler tells the DMA engine what it will need 2 steps in advance, the DMA engine can optimize for throughput rather than latency. Hardware power management improves. Interconnect utilization smooths. Memory controller queuing reduces.
Memory pooling
CXL.mem gives the compiler a named, coherent third tier between HBM and network storage. Placement decisions become concrete.
Graph replay
The GPU executes a pre-built kernel dependency graph. The CPU is removed from the hot path entirely.
DMA offload
A SmartNIC that accepts DMA descriptor chains from a compiler artifact can move tensors between nodes without CPU mediation.
SSD → HBM
GPUDirect Storage closes the gap between the cold storage tier and HBM, enabling compiler-planned overflow to NVMe.
CXL, SmartNICs, GPUDirect Storage, NVLink, and CUDA Graphs are not independent hardware stories. They are the hardware primitives that compiler-controlled memory scheduling requires. Each one exposes a control surface the compiler can target. The compiler is the integrating layer above them.
Cluster-level distributed traffic control
The most ambitious extension of compiler-planned execution is not one GPU — it is one compiler plan across an entire inference cluster, treating the multi-node system as a single scheduled pipeline.
A cluster-level compiler would produce a plan that answers questions no single-device compiler ever needs to ask:
# Conceptual cluster execution plan — window 17 window_17: gpu0.exec: attention_layer_21 gpu1.dma: kv_block_104 → gpu0.hbm.slot7 # NVLink nic0.route: activations_rank3 → gpu2 # IB/RoCE cxl.prefetch: weights_layer_22 → gpu0.near_mem # CXL.mem fence: wait(kv_block_104, weights_layer_22) release: gpu0.hbm.slot2 window_18: gpu0.exec: attention_layer_22 # all deps guaranteed ready gpu2.dma: kv_block_105 → gpu2.hbm.slot3 # ...
This is not science fiction — it is a natural extension of what XLA already does for TPU pods, where a single compilation pass produces a globally coordinated execution plan for hundreds of chips. The open question is how far this extends: to heterogeneous clusters with GPUs, CXL memory, SmartNICs, and multiple storage tiers, all coordinated by a single compiler artifact.
The AI compiler of 2020 was a graph optimizer. The AI compiler of 2025 is a memory traffic planner. The AI compiler of 2027 may be a distributed traffic controller — producing deterministic execution plans for the entire inference fleet, not just one chip.
The new objective function
The old compiler objective was to minimize instruction count. Reduce branches, pack computations, vectorize loops. These are still relevant but they are no longer the binding constraint in AI inference.
The new objective is: minimize unnecessary memory movement while guaranteeing throughput and latency bounds. That means the compiler must reason about bytes, not just operations. It must reason about time, not just transformation. It must reason across multiple memory tiers, multiple devices, and multiple concurrent sessions simultaneously.
The future AI stack will not be defined by who computes fastest — it will be defined by who moves memory most intelligently. And the compiler is becoming the system that makes that intelligence possible.
Code generator
Transforms ops → kernels. Optimizes instruction sequences. Output: executable binary.
Memory planner
Liveness + placement + DMA schedule + graph capture. Output: deterministic execution plan.
Traffic controller
Cluster-wide orchestration across GPUs, CXL, SmartNICs, storage. Output: distributed plan for all devices.