The Compiler Is the New Kernel: Why MLIR/Triton/XLA Are the Most Underrated Layer in AI Infrastructure
- Why the compiler layer is becoming critical path
- MLIR: multi-level IR as an architectural stance
- Triton: the democratization layer and its real limits
- XLA and HLO: the scheduling IR nobody reads carefully enough
- Where compiler meets memory policy
- Memory intent emission: what compilers must start doing
- Hardware-compiler co-design is no longer optional
- The compiler is not a build step
1. Why the compiler layer is becoming critical path
In the GPU-first era of AI training, the compiler was invisible. PyTorch's eager execution mode ran operations directly against CUDA libraries. cuBLAS handled matrix multiplications. cuDNN handled convolutions. The compiler's job was to translate Python to something that could call those libraries. It was a thin translation layer, not a systems component.
That model breaks down under inference pressure.
Autoregressive decode is not a library problem. It is a system-scheduling problem. Every decode step reads weights from HBM, reads KV cache from HBM, performs attention, updates the cache, and produces one token. The arithmetic intensity of a decode step on a 70B model is approximately 1.6 FLOP/byte — ninety times below the hardware ridge point of a H200. No library call fixes this. The only thing that can approach the bandwidth limit is a compiler that understands the full memory access pattern of the workload and plans data movement explicitly rather than letting the hardware speculate.
This is the structural reason the compiler has moved from build tooling to systems component: the hardware is fast enough that the binding constraint is data movement orchestration, and orchestration requires a layer that can see the whole computation graph before it executes.
The core argument this essay makes: MLIR, Triton, and XLA are not compiler technologies competing for mindshare. They are three different answers to the question of where in the software stack you want to enforce memory layout, tiling strategy, and kernel fusion decisions. Each answer has structural consequences for what optimizations are possible and which workloads you can serve efficiently.
2. MLIR: multi-level IR as an architectural stance
MLIR stands for Multi-Level Intermediate Representation. The name describes the mechanism but not the insight. The insight is this: there is no single level of abstraction at which you can express all the information needed to generate optimal machine code for a tensor workload. You need to reason about tensor algebra, memory layout, hardware parallelism, and instruction selection — and these are not just different passes on the same representation. They are structurally different concerns that interact in complex ways.
MLIR's answer is to define a tower of dialects: structured representations at each level of abstraction, with explicit lowering passes between them. A typical lowering path looks like:
MLIR dialect lowering pathLinalg dialect — tensor algebra: matmul, conv, reduction
↓ tiling + fusion
Vector dialect — explicit SIMD: vector.contract, vector.transfer
↓ bufferization
MemRef dialect — explicit memory: alloc, load, store, view
↓ SCF lowering
SCF dialect — structured control flow: for, if, while
↓ LLVM lowering
LLVM IR — target-independent machine representation
↓ backend codegen
PTX / AMDGPU / SPIRV — device-specific instruction stream
At each level, the compiler knows different things. At the Linalg level, it knows that an operation is a matrix multiplication — it can apply tile-size selection, loop reordering, and producer-consumer fusion before layouts are committed. At the MemRef level, it knows the exact memory addresses being accessed — it can insert prefetch instructions, plan DMA transfers, or reason about cache line utilization. At the LLVM level, it knows the register file — it can schedule instructions to hide memory latency.
This matters enormously for inference. The key optimization decisions in autoregressive decode — which weight matrices to keep resident in L2, how to tile the KV cache access, how to fuse attention with the subsequent projection — cannot be made at any single level of abstraction. They require the compiler to carry information from the tensor algebra level (what operation is this, what is its reuse pattern) down to the memory level (how should these bytes be laid out and when should they be fetched).
2.1 Why monolithic compilers cannot do this
The alternative to MLIR's multi-level approach is a monolithic compiler with a single fixed IR — which is what traditional compilers (LLVM itself, NCC, older versions of XLA) use. Monolithic IRs pick one level of abstraction and flatten everything into it. The problem is that information gets lost in the flattening.
If you flatten a tensor workload into LLVM IR before you've made tiling decisions, you've lost the high-level structure needed to decide tile sizes. If you make tiling decisions before you know the target memory hierarchy (e.g., HBM vs. L2 vs. SRAM), you'll tile for the wrong level. MLIR lets you preserve structure until the moment you have enough information to make each decision correctly — which is the right engineering answer and the reason it has become the dominant IR infrastructure for ML compilers.
2.2 MLIR in practice: who uses it and what it enables
| Compiler/Framework | MLIR Dialects Used | Key Benefit |
|---|---|---|
| IREE | Linalg, Flow, HAL, VM | End-to-end deployment across CPU/GPU/accelerators from a single IR tower |
| MLIR-based XLA (new) | MHLO, Linalg, MemRef | Cleaner fusion decisions, better layout propagation than HLO-only XLA |
| Torch-MLIR | Torch dialect → Linalg | PyTorch front-end that preserves tensor semantics through lowering |
| Shark / Turbine | Linalg + custom dispatch | Dynamic shape handling with static-shape kernel specialization |
| Modular / Mojo | Custom dialects | Language-level memory layout control in dialect tower |
The convergence on MLIR across competing compiler stacks is not coincidence — it reflects that the multi-level abstraction tower is the correct structural answer to the problem of targeting heterogeneous AI hardware from high-level Python tensor operations.
3. Triton: the democratization layer and its real limits
Triton occupies a different position in the stack. It is not a general compiler infrastructure. It is a domain-specific language and compiler targeting GPU tensor programs at the block level. Its design premise is: CUDA programming is too low-level for most ML practitioners, but PyTorch is too high-level for anyone who needs control over memory layout and tiling. Triton targets the gap.
A Triton kernel expresses computation in terms of blocks — rectangular tiles of a tensor — and the programmer controls tiling strategy, shared memory usage, and pipeline depth through block-size parameters. The Triton compiler handles the mapping from block operations to PTX instructions, including shared memory allocation, thread-block scheduling, and warp-level synchronization.
Triton matmul kernel (simplified)@triton.jit
def matmul_kernel(A, B, C, M, N, K,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
pid_m = tl.program_id(0)
pid_n = tl.program_id(1)
# Tile offsets
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
# Accumulator
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
# Tiled K-loop: load from HBM in BLOCK_K tiles, accumulate
for k in range(0, K, BLOCK_K):
a = tl.load(A + offs_m[:, None] * K + (k + offs_k)[None, :])
b = tl.load(B + (k + offs_k)[:, None] * N + offs_n[None, :])
acc += tl.dot(a, b)
tl.store(C + offs_m[:, None] * N + offs_n[None, :], acc)
This is 25 lines of code that, with appropriate block sizes, can approach cuBLAS performance on modern GPUs. The same operation in CUDA requires managing shared memory explicitly, writing warp-level synchronization barriers, handling bank conflicts, and tuning occupancy — typically 200-400 lines of code requiring deep hardware expertise.
3.1 What Triton actually democratizes
Triton does not replace CUDA. It replaces the need to write custom CUDA kernels for new operator patterns. This is a critical distinction for inference infrastructure. The long tail of ML operator patterns — custom attention variants, new activation functions, fused quantization operations, custom MoE routing — previously required CUDA expertise to implement efficiently. With Triton, a practitioner who understands memory access patterns and tiling strategy can write efficient GPU kernels without knowing PTX scheduling or warp-level synchronization.
The practical consequence is visible in vLLM, FlashAttention, and SGLang: all of them use Triton kernels for their core attention operations. FlashAttention-2's entire performance advantage — the tiled attention algorithm that keeps activations in SRAM — is implemented as a Triton kernel. Without Triton, Flash Attention's adoption would have been far slower because porting the algorithm to every GPU target would have required device-specific CUDA expertise.
3.2 Where Triton hits limits
Triton's abstraction level creates two important constraints. First, it operates at the within-kernel level — it cannot reason about across-kernel data movement, multi-kernel fusion opportunities, or the interaction between different kernels sharing HBM. For inference workloads where the bottleneck is the cumulative HBM traffic across an entire forward pass, within-kernel optimization is necessary but not sufficient.
Second, Triton's block-level abstraction assumes the programmer can specify block sizes statically or through constexpr parameters. Dynamic shapes — the batch dimension changing per request, the context length varying across requests — require either kernel specialization (compiling multiple versions) or block-size choices that are conservative for all cases. Neither option is free: specialization increases compilation overhead, and conservative block sizes leave performance on the table.
Triton is the right tool for writing efficient individual GPU kernels. It is not the right tool for scheduling data movement across the full inference forward pass. That requires a higher-level compiler with a whole-graph view — which is where MLIR and XLA operate.
4. XLA and HLO: the scheduling IR nobody reads carefully enough
XLA (Accelerated Linear Algebra) was Google's answer to the question: how do we make TensorFlow computations run efficiently on TPUs and GPUs at scale? Its core innovation was not any particular optimization pass. It was the decision to compile the entire computation graph ahead of time, rather than dispatching operations individually at runtime.
XLA's IR is HLO — High-Level Operations. HLO is a functional IR that represents tensor operations with explicit shape and layout information. Every operation in HLO has a statically known output shape. Every tensor in HLO has an explicit layout descriptor specifying the memory order of its dimensions (row-major, column-major, or arbitrary permutations).
The layout annotation is the underappreciated power of HLO. It means that XLA can reason about whether two operations can be fused without a layout conversion. Layout conversions are expensive — they touch every element of a tensor and can consume significant bandwidth. A compiler that cannot reason about layouts will generate unnecessary conversions. XLA's layout propagation pass eliminates them.
HLO layout representation (conceptual)// HLO tensor with explicit layout
// {3, 2, 1, 0} means dimensions are stored in reverse order (column-major for 2D)
tensor<128x1024xf16> layout={1,0} -- row-major: batch × hidden
tensor<1024x4096xf16> layout={0,1} -- col-major: hidden × ffn_dim
// Fusion is legal only when output layout of producer
// matches input layout expected by consumer.
// XLA's layout propagation assigns layouts to minimize conversions.
4.1 Fusion: the real value of whole-graph compilation
XLA's most important optimization is kernel fusion — merging multiple operations into a single kernel that reads input once, performs multiple computations, and writes output once. The performance impact is substantial. Consider the LayerNorm → Linear → GeLU → Linear pattern that appears in every transformer FFN block:
| Execution Model | HBM Reads | HBM Writes | Kernel Launches |
|---|---|---|---|
| Unfused (eager) | 4 full tensors | 4 full tensors | 4 |
| XLA fused | 2 full tensors (input + weight) | 1 full tensor (output) | 1 |
| Bandwidth saving | ~55% reduction in HBM traffic for this pattern | ||
The reason unfused execution is expensive is precisely the memory-bandwidth bound we established earlier. Each intermediate tensor must be written to HBM and read back. At HBM bandwidth of 3.35 TB/s, that write-read round trip takes real time — time during which no useful compute is happening. XLA's fusion pass eliminates most of these intermediate materializations.
4.2 What XLA cannot do: the dynamic shape problem
XLA's whole-graph compilation requires statically known shapes. This is fine for training (fixed batch sizes, fixed sequence lengths) and was fine for early inference (fixed-length prompts). It breaks for modern inference with variable-length contexts, dynamic batch sizes, and speculative decoding with variable acceptance rates.
The response from the XLA team has been dynamic shapes support in a newer version of XLA (integrated into JAX and the new XLA/OpenXLA MLIR-based stack). But dynamic shapes add compilation overhead and can degrade performance relative to static-shape specialized kernels. This tension — static-shape performance vs. dynamic-shape flexibility — is currently the central compiler engineering challenge for production inference systems.
5. Where compiler meets memory policy
This blog has argued extensively that memory policy — decisions about where tensors live in the memory hierarchy, when they move, and when they are evicted — is the core determinant of inference system performance. The compiler is the layer where that policy must be expressed.
The problem with current inference systems is that memory policy and compilation are two separate concerns managed by two separate systems that do not communicate. The compiler (XLA, Triton, TensorRT) makes kernel-level decisions about memory layout and fusion. The runtime (vLLM's PagedAttention, the KV cache manager, the batching scheduler) makes request-level decisions about memory allocation. Neither system knows what the other is doing.
The consequence is predictable: the compiler generates kernels that assume tensors live in HBM at fixed addresses. The runtime manages KV pages at the granularity of page tables. When a KV page gets evicted to CPU memory and later fetched back, the kernel doesn't know — it just sees a cache miss that the hardware handles speculatively. The hardware speculation is the wrong mechanism: it adds latency unpredictably, wastes HBM bandwidth on speculative prefetch, and has no visibility into the application-level importance of the data being moved.
The structural gap: Today's compilers generate kernels that are semantically unaware of memory tier. They emit load/store instructions that tell the hardware what to move. They should be emitting structured memory intent that tells the runtime why this data is needed, how important it is, and what the cost of missing it would be — so that the memory orchestration layer can make a policy decision rather than relying on hardware speculation.
6. Memory intent emission: what compilers must start doing
Memory intent emission is the mechanism by which a compiler annotates its output with structured metadata about the memory access patterns of the generated kernels. This is distinct from performance hints (which exist today in some form) because it expresses semantics, not just access frequency.
A minimal memory intent annotation for a KV cache access might look like:
Memory intent IR annotation (illustrative)// Compiler-emitted memory intent annotation
@kv_fetch {
tensor_class: KV_CACHE,
request_id: req_847,
layer_range: [0, 31], // transformer layers this fetch serves
token_range: [0, 4096], // context token range
reuse_horizon: 1, // expected reuse within N decode steps
eviction_cost: HIGH, // cost if this is evicted mid-sequence
tier_preference: HBM, // preferred residency tier
fallback_policy: RECOMPUTE // action if tier_preference unavailable
}
load v128i8, ptr %kv_cache_ptr, align 128
This annotation is emitted alongside the generated PTX. The memory orchestration layer — whether that's a hardware fabric controller, a DPU-resident scheduler, or a software KV manager — can consume these annotations and make admission, residency, and eviction decisions that are semantically informed rather than heuristic.
The compiler is the right place to emit this because the compiler is the only layer with visibility into both the computation structure (which operations consume this tensor, in what order, with what parallelism) and the hardware constraints (bandwidth, capacity per tier, access latency). Neither the application layer nor the hardware has this combined view.
6.1 What this enables downstream
| System Component | Without Memory Intent | With Memory Intent |
|---|---|---|
| KV cache manager | Evicts by LRU / recency heuristic | Evicts by compiler-annotated cost, preserving high-eviction-cost pages |
| Prefetch engine | Hardware speculative prefetch based on access patterns | Explicit scheduled prefetch driven by compiler's reuse horizon annotations |
| Tier placement | Static: HBM for active model, DRAM for overflow | Dynamic: compiler-annotated tier preference drives runtime placement |
| Recompute policy | Never recompute; always reload from DRAM if evicted | Compiler identifies low-reuse tensors where recompute is cheaper than reload |
7. Hardware-compiler co-design is no longer optional
The reason this matters beyond compiler engineering is that the next generation of AI accelerators — Rubin, Blackwell Ultra, and their successors — are being designed with programmable memory hierarchies. Blackwell's NVLink Switch fabric can route data between GPUs in a rack without host CPU involvement. Future CXL-attached memory devices will expose programmable tiering APIs. TPU Pods already have optical circuit switching that can reshape the network topology for different computation phases.
None of these hardware capabilities can be exploited by a compiler that treats memory as a flat address space. They require a compiler that understands memory as a hierarchy of tiers with different bandwidth, latency, and cost profiles — and that emits programs that explicitly orchestrate movement through those tiers.
This is the co-design requirement: hardware teams are building increasingly programmable memory fabrics, but those fabrics are only useful if the compiler layer emits programs that can drive them. The hardware roadmap and the compiler roadmap are converging on the same technical requirement from two directions.
Google understood this earlier than anyone — TPU's scratchpad SRAM is only useful because XLA's tiling algorithm explicitly manages SRAM residency. The TPU has no hardware cache because the compiler provides the caching logic. The rest of the industry is arriving at this same conclusion through the inference performance crisis.
8. The compiler is not a build step
The conventional model of an AI compiler is: training engineers write a model in PyTorch, a compiler converts it to optimized kernels, infrastructure teams deploy those kernels. The compiler is a one-time build step, not an operational component.
This model is wrong for modern inference and will become more wrong as systems scale. The correct model is: the compiler is a continuous operational component that produces programs with structured memory annotations, which are consumed by a runtime memory orchestration layer that makes placement and movement decisions at serving time.
MLIR provides the multi-level structure to preserve semantic information through the compilation process. Triton provides the mechanism to write efficient device kernels without per-device CUDA expertise. XLA provides the whole-graph view needed for fusion and layout optimization. What none of them yet provides adequately is the structured memory intent emission that bridges the compiler and the memory orchestration runtime.
That is the frontier. The teams that close it — whether through compiler extensions, new IR annotations, or hardware-compiler contracts — will determine the efficiency ceiling of the next generation of AI inference infrastructure.
The hardware is waiting. The memory is the bottleneck. The compiler is the only layer that can fix it.