compiler theory · memory systems · deterministic inference

The Compiler Becomes
the Memory Scheduler

Runtime dynamism is becoming too expensive for AI inference at scale. The next generation of AI compilers will not just optimize computation — they will pre-plan every byte's journey through the memory hierarchy before execution begins.

12 sections ~22 min read DMA · CUDA Graphs · XLA/MLIR · tensor liveness · determinism
~10 μs
CUDA kernel launch overhead (CPU path, per launch)
<1 μs
CUDA graph replay latency (after capture)
60–80%
Wall-clock time that can be memory traffic, not GEMM
32–48
Transformer layers each needing weight + KV staging
O(n²)
How synchronization overhead scales with reactive allocation
Section 01

The old assumption is breaking

For decades, the compute stack was built on a comfortable foundation: the runtime can decide things dynamically, and that's fine. Dynamic allocation, dynamic scheduling, dynamic dispatch, dynamic paging.

This worked because CPUs were balanced systems — memory access was slow relative to compute, but predictably so. Workloads were irregular and bursty, so static planning was often impossible. Latency budgets were forgiving. The runtime had time to think.

The breaking point

AI inference changes all three assumptions simultaneously. Memory bandwidth is now the binding constraint, not compute. Workloads run billions of near-identical iterations. Latency budgets have collapsed to sub-millisecond SLAs. The runtime no longer has time to think.

The result is a quiet revolution in how AI systems are compiled, scheduled, and executed. Not in the math — that's still GEMM on tensors. But in the orchestration layer underneath. The compiler is evolving from a code generator into a memory traffic planner, and the implications reach from kernel scheduling all the way to cluster topology.

The shift from runtime-dynamic to compiler-planned execution Runtime-Dynamic World allocate_now() copy_when_missing() page_on_fault() fence_when_needed() → discovers stalls after blocking compile-time planning Compiler-Planned World static_placement(tensor, tier) schedule_dma(t=N-2, src, dst) precomputed_fence(F1, F2) bounded_slot(id, release=t) → eliminates stalls before they occur
// The core shift: from reactive runtime decisions to precomputed memory traffic plans emitted by the compiler.
Section 02

The real problem is memory traffic

The industry still measures AI hardware in teraFLOPs. The actual bottleneck in production inference systems is increasingly something different: how fast bytes move, not how fast multiplications happen.

Step through a single transformer decode iteration on a large model and list what must happen before the matrix multiply can start:

# One transformer decode step — the unglamorous full picture

1. Prefetch weights for layer N+1         # from HBM or CXL
2. DMA KV cache block for this request     # from DRAM or CXL pool
3. Stage activations from previous layer   # HBM → compute cache
4. Wait on synchronization fences          # stream ordering
5. ⚡ GEMM / attention compute             # the part everyone counts
6. Write new KV entries back               # to HBM / CXL
7. Stream output tokens                    # to decode buffer
# Steps 1–4 and 6–7 are memory traffic. Step 5 is compute.
Inference wall-time breakdown: memory traffic vs compute wall-clock time per decode step DMA engine GPU compute sync / fences writeback prefetch weights + KV blocks (often 60–80% of step time) GEMM/attn fences KV writeback t=0 t=T
// The GPU compute window (GEMM/attention) is often a minority of wall time. The rest is orchestration: DMA, prefetch, sync, writeback.
The arithmetic roofline is not the problem

A model can be entirely arithmetic-bound in isolation on one layer — and still be memory-bound system-wide across all layers, because the coordination between layers costs more than the computation inside them.

Section 03

Why runtime dynamism gets expensive

Traditional runtimes make decisions just-in-time. This is fine when decisions are cheap and errors are recoverable. In AI inference at scale, both conditions fail.

Reactive runtime

The pipeline stalls, then recovers

kernel starts
dependency missing
runtime detects miss
launch DMA
stall waiting
DMA completes
resume
# detection + dispatch = 10–50μs

The pipeline blocks on discovering the miss. The cost is not just the DMA time — it includes detection latency and dispatch latency.

Compiler-planned

Traffic is always ahead of compute

# compiler emitted at build time:
t=N-2: DMA weights_L9 → HBM_slot4
t=N-2: DMA kv_block42 → HBM_kv0
t=N-1: fence(DMA_9, DMA_kv)
t=N:   exec attention_L9
# GPU never sees an empty pipeline

DMA is issued 2 steps early. By the time compute arrives at L9, the data is already there. Zero stall time.

Beyond individual stalls, runtime dynamism has systemic costs that compound:

Cost 1

Allocator fragmentation

Dynamic malloc/free fragments the HBM pool over time, causing increasingly expensive allocation searches and potential out-of-memory despite available space.

Cost 2

Synchronization overhead

Reactive fences and stream waits add per-operation overhead. At thousands of micro-operations per second, O(n) sync cost per handoff compounds to milliseconds.

Cost 3

CPU-side scheduling

Each kernel launch requires a CPU call (~10μs host overhead). For models with thousands of operations per forward pass, this alone adds ~10ms — entirely avoidable with graph capture.

Section 04

Static execution planning

The key insight is simple: if the computation graph is mostly known before execution begins, memory movement can be planned before execution begins.

This is not a new idea in computing. It's the same principle that transformed compilers from interpreters to ahead-of-time code generators. The question is what "known ahead of time" means for AI inference — and the answer is: more than most people assume.

At inference time, the model architecture is fixed. The layer sequence is fixed. The tensor shapes are fixed (for a given batch/context size). What changes is the data — the token values, the KV cache contents. But the movement pattern of that data through the memory hierarchy is almost entirely static. The compiler can plan it.

LayerOld runtime behaviorCompiler-planned behaviorSavings
Memory allocation Dynamic malloc/free per tensor Static placement via liveness analysis no fragmentation
Tensor movement DMA triggered on cache miss DMA issued N steps before needed zero stall
Synchronization Runtime fences on every handoff Precomputed producer–consumer barriers minimal fence overhead
Kernel launches CPU-driven per launch (~10 μs each) Captured graph, replayed from GPU 10–100× lower launch latency
Memory tiering Page on demand Compile-time residency plan per tier no page faults
Buffer reuse Independent allocations per op Overlapping lifetime → shared slots lower peak HBM usage
Section 05

Tensor liveness analysis: register allocation for memory

The deepest idea in compiler-controlled memory management is that it is structurally identical to register allocation — one of the most studied problems in compiler theory. Registers are a finite, fast resource. HBM slots are a finite, fast resource. The algorithm for assigning variables to registers is the same algorithm for assigning tensors to HBM slots.

Liveness analysis determines the exact time interval during which each tensor must be in memory — its live range. Two tensors with non-overlapping live ranges can share the same physical memory slot, even if they're logically different buffers.

# Liveness analysis: the core compiler primitive

class TensorLifetime:
    name:   str
    size:   int    # bytes
    born:   int    # step when first written
    dies:   int    # step after last read
    tier:   str    # HBM | DRAM | CXL | SSD

def can_alias(a: TensorLifetime, b: TensorLifetime) -> bool:
    # a dies before b is born → same slot is safe
    return a.dies <= b.born or b.dies <= a.born

# Example graph: 32-layer transformer decode step
lifetimes = [
    TensorLifetime("weights_L7",  size=512_MB, born=0,  dies=2,  tier="HBM"),
    TensorLifetime("weights_L8",  size=512_MB, born=2,  dies=4,  tier="HBM"),
    TensorLifetime("attn_tmp_L7", size=64_MB,  born=1,  dies=2,  tier="HBM"),
    TensorLifetime("attn_tmp_L8", size=64_MB,  born=3,  dies=4,  tier="HBM"),
    TensorLifetime("mlp_tmp_L7",  size=64_MB,  born=2,  dies=3,  tier="HBM"),
]

# Result: attn_tmp_L7, attn_tmp_L8, mlp_tmp_L7 can share one 64 MB slot
#         weights_L7 and weights_L8 can share one 512 MB slot
# Peak HBM: 512+64 MB instead of 512+512+64+64+64 MB
Tensor lifetime diagram showing buffer reuse opportunities weights_L7 512 MB · HBM weights_L8 512 MB · HBM ← can share one 512 MB HBM slot → attn_tmp_L7 64 MB mlp_tmp_L7 64 MB attn_tmp_L8 64 MB all three 64 MB tmps share one slot → 192 MB → 64 MB t0 t1 t2 t3 t4 t5
// Liveness analysis: non-overlapping lifetimes → shared physical slots. This is register allocation, applied to HBM. Peak memory drops from 1216 MB to 576 MB (52% reduction) for this example.

This technique is not theoretical — XLA (Google's compiler for TPU/GPU) has used buffer assignment via liveness analysis since 2017. TVM applies it across heterogeneous memory tiers. MLIR's buffer placement passes generalize it to the multi-level IR ecosystem. What's changing is the scope: these techniques are now being extended to multi-device, multi-tier memory hierarchies including HBM, DRAM, CXL pools, and NVMe.

Section 06

Compiler-scheduled DMA

DMA scheduling is where memory planning becomes a hardware–software contract. The compiler doesn't just decide where data lives — it decides when the DMA engine should move it, relative to the compute timeline.

Modern DMA controllers accept descriptor chains: pre-built lists of transfer operations with fence dependencies. The compiler can emit these chains as a static artifact, requiring minimal runtime intervention to execute.

// Compiler-emitted DMA descriptor chain (C struct style)
struct dma_desc {
    uint64_t src_addr;          // physical or CXL address
    uint64_t dst_addr;          // HBM slot address
    uint32_t bytes;
    uint16_t stream_id;         // which DMA engine handles this
    uint16_t dep_fence;         // wait for this fence before starting
    uint16_t comp_fence;        // signal this fence on completion
};

// Emitted by compiler for one decode window:
desc[0] = { CXL_WEIGHTS_L9,  HBM_SLOT_A, 512_MB, S0, NONE, F1 }
desc[1] = { DRAM_KV_BLOCK42, HBM_KV_0,    64_MB, S1, NONE, F2 }
desc[2] = { EXEC: ATTENTION_L8,        wait=[],       sig=F3 }
desc[3] = { EXEC: ATTENTION_L9,        wait=[F1,F2]       }
desc[4] = { RELEASE: HBM_SLOT_A,       after=F3           }
DMA and compute overlap timeline without DMA scheduling DMA weights L9 attn L8 DMA KV block 42 attn L9 total: DMA_L9 + attn_L8 + DMA_KV + attn_L9 (sequential) with compiler-scheduled DMA (overlap) DMA DMA weights L9 DMA KV block GPU attn L8 attn L9 F1: weights ready F2: KV ready (fence deps) total: ≈ max(DMA, compute) — not sum
// Overlap converts sequential time (sum) into parallel time (max). The compiler makes this possible by issuing DMA before the compute that needs it.

This is directly implemented in production systems. NVIDIA's cudaMemcpyAsync with explicit stream ordering is the low-level primitive. CUDA Graphs capture the dependency graph. Higher-level frameworks like Triton and FlashAttention use pipelining primitives to overlap loads and compute within a single kernel. The frontier is pushing this principle from single-kernel scope to whole-graph, multi-device scope.

Section 07

Graph capture: the first visible step

CUDA Graphs are the clearest production sign of this transition already underway. Instead of launching kernels one by one from the CPU — incurring ~10 μs host overhead per launch — a workload is captured once and replayed from GPU memory.

Dynamic kernel launch (old)
# CPU launches every kernel individually
for token in tokens:
  launch(q_proj, x)        # ~10μs
  launch(k_proj, x)        # ~10μs
  launch(v_proj, x)        # ~10μs
  launch(attn, q, k, v)   # ~10μs
  launch(mlp, y)           # ~10μs
  # 50μs CPU overhead per token
  # 100 tokens → 5ms wasted CPU time
CUDA Graph replay (new)
# Capture once:
graph = cuda.capture(decode_step)

# Replay per token (GPU-side, no CPU):
for token in tokens:
  graph.replay(input_ptr,
               output_ptr)
  # ~0.5μs total overhead
  # 100 tokens → 50μs
  # 100× lower launch latency
CUDA Graph capture vs dynamic launch latency comparison dynamic launch (per token, 5 kernels) CPU→GPU CPU→GPU CPU→GPU CPU→GPU CPU→GPU ≈ 50 μs overhead CUDA graph replay (same 5 kernels) replay node → GPU executes entire graph internally ≈ 0.5 μs overhead 100× reduction in kernel launch overhead — and this is just the first step
// CUDA Graphs eliminate the CPU-to-GPU dispatch round-trip for every kernel. Replay issues the full graph from GPU memory in ~0.5μs vs ~10μs per launch.

But graph capture is a necessary stepping stone, not the destination. The deeper shift is: graph capture + tensor placement + DMA scheduling + bounded buffer reuse. CUDA Graphs give you the replay primitive. Liveness analysis gives you the placement plan. The compiler's job is to produce all three together.

This is what XLA does end-to-end for TPU workloads — the compilation step produces buffer assignments, DMA schedules, and kernel graphs as a single artifact. The runtime is nearly vestigial: it just executes the plan. The compiler made all the decisions.

Section 08

Bounded buffers: inference as a ring pipeline

One of the most elegant properties of compiler-planned execution is that it enables bounded memory use. Instead of a runtime that grows, fragments, and occasionally panics with OOM, the compiler computes the maximum peak memory needed — provably — and allocates it up front as a fixed pool with named slots.

This is structurally identical to how network interface cards manage packet queues: a fixed ring buffer, where slots cycle through empty → owned → complete → empty with no dynamic allocation.

Bounded buffer ring diagram for inference execution slot 0 FREE slot 1 DMA active slot 2 → COMPUTE slot 3 writeback slot 4 FREE slot 5 FREE slot 6 FREE slot 7 FREE N bounded HBM slots fixed size, cycling compiler determines N at build time — no runtime allocator, no fragmentation, no OOM surprises
// Bounded buffer ring: N slots, each cycling through FREE → DMA_ACTIVE → COMPUTE → WRITEBACK → FREE. The compiler knows N statically. No dynamic allocator is needed.
# What the compiler produces
HBM_POOL = [
    Slot(id=0, size=512_MB, tier="HBM"),   # weights double-buffer A
    Slot(id=1, size=512_MB, tier="HBM"),   # weights double-buffer B
    Slot(id=2, size=64_MB,  tier="HBM"),   # attention temp
    Slot(id=3, size=256_MB, tier="HBM"),   # KV active window
    Slot(id=4, size=256_MB, tier="HBM"),   # KV prefetch window
]
# total HBM committed: 1600 MB — known at compile time, guaranteed sufficient
# runtime never calls malloc()

This has a second-order benefit beyond avoiding fragmentation: it makes the memory footprint statically verifiable. A safety-critical deployment can prove at compile time that the model will never exceed its memory budget — no runtime surprises.

Section 09

Deterministic execution windows

Determinism is not an academic nicety — it is an operational requirement for production AI systems. Non-deterministic memory movement creates tail latency, and tail latency in AI inference compounds across batches, agent loops, and service tiers in ways that are difficult to debug and expensive to provision for.

PropertyDynamic runtimeCompiler-planned determinism
Latency profile p50 good, p99 painful — driven by allocator and page-fault jitter Bounded by compile-time worst-case analysis
Memory footprint Grows over time due to fragmentation Fixed by compiler allocation plan
DMA timing Reactive — triggered by misses Scheduled — issued N steps early
Debuggability Latency spikes are hard to reproduce Execution is replayable from descriptor log
Safety certification Runtime behaviour cannot be fully pre-verified Static analysis possible over compiled plan
SLA provisioning Must over-provision for p99 headroom Provision for known max, not statistical estimate
Why tail latency matters more than average latency

In a 32-step agentic loop where each step has p99 latency 2× above p50, the probability of hitting at least one high-latency step is 1 - 0.99^32 ≈ 27%. Nearly one in three agent sessions hits a tail event. Compiler-planned determinism attacks this by compressing the gap between p50 and p99.

Section 10

The full compiler pipeline

Putting the pieces together: a memory-aware AI compiler takes a model graph and produces not one but three artifacts — a compute graph, a memory traffic plan, and a synchronization schedule. These three together constitute the "deterministic execution plan" that a thin runtime can execute with minimal improvisation.

Compiler pipeline: from model graph to deterministic execution plan Model graph ops + shapes Liveness tensor lifetimes alias candidates Placement HBM/DRAM/CXL slot assignment DMA sched descriptor chain overlap windows Graph capture compiler produces three artifacts: Compute graph kernels + dep graph CUDA graph artifact Memory plan slot assignments tier residency map Traffic schedule DMA descriptor chain fence dependency graph Thin deterministic runtime executes the plan, makes no decisions
// The compiler is no longer a kernel emitter. It is a system planner: liveness analysis → placement → DMA schedule → graph capture → one deterministic artifact the runtime just executes.

This is the architecture of XLA on TPUs today. The TPU runtime is deliberately thin — it executes the compiled plan and does almost nothing else. The insight is that a rich runtime is not a feature; it is deferred compilation, with all the associated unpredictability. Moving those decisions into the compiler makes them auditable, optimizable, and deterministic.

Section 11

Hardware must cooperate

Compiler-controlled execution is only possible when hardware exposes the right interfaces. If the compiler emits a DMA schedule but the hardware has no mechanism to accept and hold that schedule, the optimization is unreachable. This is why the compiler story is inseparable from the hardware roadmap.

What compilers need from hardware

Explicit control surfaces

Programmable DMA engines that accept descriptor chains. Memory-tier visibility (HBM vs DRAM vs CXL). Compiler-addressable prefetch units. Named synchronization fences. Graph replay primitives. Residency hint interfaces.

What hardware gains in return

Predictable workloads

If the compiler tells the DMA engine what it will need 2 steps in advance, the DMA engine can optimize for throughput rather than latency. Hardware power management improves. Interconnect utilization smooths. Memory controller queuing reduces.

CXL

Memory pooling

CXL.mem gives the compiler a named, coherent third tier between HBM and network storage. Placement decisions become concrete.

CUDA Graphs

Graph replay

The GPU executes a pre-built kernel dependency graph. The CPU is removed from the hot path entirely.

SmartNICs

DMA offload

A SmartNIC that accepts DMA descriptor chains from a compiler artifact can move tensors between nodes without CPU mediation.

GPUDirect

SSD → HBM

GPUDirect Storage closes the gap between the cold storage tier and HBM, enabling compiler-planned overflow to NVMe.

The convergence

CXL, SmartNICs, GPUDirect Storage, NVLink, and CUDA Graphs are not independent hardware stories. They are the hardware primitives that compiler-controlled memory scheduling requires. Each one exposes a control surface the compiler can target. The compiler is the integrating layer above them.

Section 12 — The Frontier

Cluster-level distributed traffic control

The most ambitious extension of compiler-planned execution is not one GPU — it is one compiler plan across an entire inference cluster, treating the multi-node system as a single scheduled pipeline.

A cluster-level compiler would produce a plan that answers questions no single-device compiler ever needs to ask:

# Conceptual cluster execution plan — window 17
window_17:
  gpu0.exec:    attention_layer_21
  gpu1.dma:     kv_block_104 → gpu0.hbm.slot7   # NVLink
  nic0.route:   activations_rank3 → gpu2          # IB/RoCE
  cxl.prefetch: weights_layer_22 → gpu0.near_mem  # CXL.mem
  fence:        wait(kv_block_104, weights_layer_22)
  release:      gpu0.hbm.slot2

window_18:
  gpu0.exec:    attention_layer_22              # all deps guaranteed ready
  gpu2.dma:     kv_block_105 → gpu2.hbm.slot3
  # ...
Cluster-level compiler plan: compiler as distributed traffic controller Cluster Compiler global traffic controller GPU 0 compute plan GPU 1 DMA plan SmartNIC routing plan CXL pool prefetch plan Fences sync schedule all devices execute their fragment of the plan independently fences are the only coordination point compiler = distributed traffic controller
// The cluster-level compiler produces a per-device fragment of a global execution plan. Each device executes its fragment independently; the fence schedule coordinates the seams.

This is not science fiction — it is a natural extension of what XLA already does for TPU pods, where a single compilation pass produces a globally coordinated execution plan for hundreds of chips. The open question is how far this extends: to heterogeneous clusters with GPUs, CXL memory, SmartNICs, and multiple storage tiers, all coordinated by a single compiler artifact.

The AI compiler of 2020 was a graph optimizer. The AI compiler of 2025 is a memory traffic planner. The AI compiler of 2027 may be a distributed traffic controller — producing deterministic execution plans for the entire inference fleet, not just one chip.

Final Thought

The new objective function

The old compiler objective was to minimize instruction count. Reduce branches, pack computations, vectorize loops. These are still relevant but they are no longer the binding constraint in AI inference.

The new objective is: minimize unnecessary memory movement while guaranteeing throughput and latency bounds. That means the compiler must reason about bytes, not just operations. It must reason about time, not just transformation. It must reason across multiple memory tiers, multiple devices, and multiple concurrent sessions simultaneously.

The thesis in one sentence

The future AI stack will not be defined by who computes fastest — it will be defined by who moves memory most intelligently. And the compiler is becoming the system that makes that intelligence possible.

Then

Code generator

Transforms ops → kernels. Optimizes instruction sequences. Output: executable binary.

Now

Memory planner

Liveness + placement + DMA schedule + graph capture. Output: deterministic execution plan.

Next

Traffic controller

Cluster-wide orchestration across GPUs, CXL, SmartNICs, storage. Output: distributed plan for all devices.