Memory Intent IR: Why AI Compilers Must Emit Memory Plans

The modern AI stack still behaves as if compute is the scarce resource and memory is a secondary implementation detail. That assumption is breaking. Performance is increasingly determined by whether the right object arrives in the right place at the right time — without avoidable copies, fabric congestion, spill storms, or eviction regret.

Fig. 01 Memory Intent IR — Three-Plane Architecture

The Memory Intent IR sits between the compiler and the runtime/hardware controller — a structured behavioral contract for AI objects that neither plane can infer on its own. The feedback loop allows observed behavior to refine compiler estimates over time.

The missing artifact in the AI stack

Today, most AI compilers emit some combination of a lowered graph, fused kernels, local buffer allocation plans, communication schedules, and device placement hints. Those artifacts are necessary. They are no longer sufficient.

They tell the system what must run. They do not tell the system how the objects involved in that execution are expected to behave over time — which tensors are truly latency-critical, which objects have short but intense reuse windows, which buffers are safe to spill, which groups of objects are usually needed together, and which data should be pinned because eviction would create a disproportionate stall.

The runtime is left to infer this, after the fact, from access history alone. That is a structurally bad position. The compiler already had most of this information. It just never wrote it down.

Compiler outputs needed: compute plan + memory plan

Memory tiers a controller may span: SRAM → NVMe

Good reasons to leave object semantics implicit

The gap: The compiler knows graph structure, tensor liveness, phase boundaries, recomputation cost, and communication windows. The controller knows nothing except access history. That asymmetry is the root cause of bad placement, premature eviction, and avoidable movement.

The core idea

A Memory Intent IR is a compiler-emitted intermediate representation that describes expected memory behavior — not just initial allocation. It is a behavioral contract for objects: how they live, how they expire, how hot they are, what deadlines matter, what tiers are acceptable, what spill policies are permitted, and what future reuse the system should preserve.

Object Semantics

What the object is

Weight tile, KV block, activation, optimizer state, gradient shard, expert weight family, checkpoint fragment, metadata page. Each type implies a different reuse pattern and eviction cost.

Lifecycle Semantics

How it lives and dies

Creation time, earliest use, latest use, persistence scope, reuse window, recompute cost, sharing rules, and retirement conditions — the temporal contract of the object.

Placement Semantics

Where it wants to live

Preferred tiers, allowed fallback tiers, prefetch horizon, replication eligibility, coalition hints (objects usually needed together), and anti-eviction rules.

Pressure Semantics

What to do under stress

What to protect under congestion, what can spill without consequence, what can be dropped and cheaply recomputed, and what is deadline-sensitive and must not stall.

How this differs from existing approaches

The easiest way to misread Memory Intent IR is to treat it as a prettier buffer allocation table. That is not what it is. Existing approaches each capture a slice of the problem. None assemble it into a first-class transferable artifact.

Existing Approach	What it captures	What it misses
Buffer allocation (XLA, TVM)	Size and initial placement per tensor	Reuse patterns, phase transitions, tier preferences, spill policy
OS page policies (LRU, LFU)	Recency/frequency of access	Object semantics, deadlines, recompute cost, shared prefix value
vLLM PagedAttention	KV block address management, fragmentation reduction	Phase-aware urgency, anti-eviction rules, prefetch horizon, hardware signaling
DeepSpeed ZeRO offload hints	Coarse CPU/GPU offload strategy	Per-object typed semantics, line-rate hardware enforcement, multi-tier placement
Compiler liveness analysis	Buffer liveness windows for register/memory allocation	Cross-tier residency, reuse estimation, pressure classes, multi-phase behavior

The point is not to record where memory is allocated. The point is to expose the expected behavior of objects across phases so the rest of the system can make better decisions — at hardware line rate, without repeated software intervention.

Why this matters now

1. Models have outgrown single-device memory

Once a model spills across HBM, host DRAM, CXL pools, and peer accelerators, the number of placement decisions that can go wrong grows rapidly. A recency policy that worked at single-GPU scale becomes a liability at rack scale.

2. KV cache has become a first-order memory management problem

Long contexts, prefix sharing, speculative decoding, and retrieval-augmented generation all push KV cache management into a domain that generic page replacement cannot handle correctly. A prefix-shared KV block may look cold to LRU yet be one of the highest-value objects in the system because dozens of decode continuations are about to depend on it.

3. Expert routing creates dynamic hotness

In MoE models, which expert weights are "hot" changes across decode steps based on routing decisions. The compiler can observe this structure statically and emit probability-weighted prefetch hints. The hardware has no access to this information without an intent channel.

4. Hardware controllers are getting smarter — but need structured input

If hardware-resident or near-hardware orchestration exists (as in MCOS-HFC), it needs structured input. Otherwise it becomes a faster blind heuristic engine — powerful machinery with no map.

"A reactive runtime alone is always late. The compiler saw the future first." — Memory Intent IR design thesis

What a Memory Intent IR could look like

Conceptually the IR is organized around three levels: object descriptors, phase descriptors, and policy overrides. Object descriptors define type, size, lifecycle, and base placement preferences. Phase descriptors capture execution stages that reshape urgency and locality. Policy overrides let phase-specific behavior modify base rules without redundancy.

// ── Object descriptors ──────────────────────────────────────

object layer_18_weight_group_A {
  class              = WEIGHT_TILE
  size_bytes         = 134_217_728          // 128 MB
  persistence_scope  = SESSION
  reuse_window       = HIGH                 // reused every layer pass
  preferred_tiers    = [SRAM, HBM]
  fallback_tiers     = [DRAM, CXL]
  spill_policy       = SEVERE_PRESSURE_ONLY
  multicast_eligible = true                // can fan-out to multiple accs
  deadline_class     = HARD
}

object kv_prefix_seg_P {
  class              = KV_BLOCK
  sharing_scope      = SEQUENCE_GROUP(17, 18, 21)  // shared by 3 seqs
  persistence_scope  = TOKEN_PROGRESS
  anti_evict_rule    = WHILE_FANOUT_GT_1   // protect while shared
  preferred_tiers    = [HBM, DRAM]
  deadline_class     = HARD
}

object decode_buf_D {
  class              = ACTIVATION_TEMP
  persistence_scope  = STEP               // expires after one step
  recompute_class    = CHEAP              // safe to recompute
  spill_policy       = ALLOW
  deadline_class     = SOFT
}

// ── Phase descriptors with overrides ────────────────────────

phase decode_step {
  override kv_prefix_seg_P {
    anti_evict_rule      = WHILE_FANOUT_GT_1
    prefetch_horizon_us  = 12
    preferred_tiers      = [HBM, DRAM]
  }
  override expert_weight_group_E3 {
    route_probability    = HIGH           // router hints this is likely
    replication_allowed  = true
    prefetch_horizon_us  = 20
  }
}

phase optimizer_update {
  override gradient_shard_G {
    persistence_scope    = UPDATE_WINDOW
    preferred_tiers      = [DRAM, CXL]   // keep off HBM during fwd pass
    anti_evict_rule      = UNTIL_ALLREDUCE_COMPLETE
  }
}

Fig. 02 IR Structure — Objects × Phases × Policies

The IR is organized as object descriptors (what the object is and how it lives) × phase descriptors (when urgency changes) × policy descriptors (how to handle placement and pressure). The controller ingests all three at load time and operates at line rate thereafter.

What new compiler passes this requires

Memory Intent IR is not just a runtime improvement. It implies new compiler passes that elevate existing analyses into a formal, transferable artifact. The compiler already performs many of the underlying analyses — it just does not emit the results in a consumable form.

Fig. 03 Compiler Pipeline — From Graph to Memory Intent IR

The compiler gains two new pass groups: analysis (elevating existing liveness, reuse, and phase analyses) and annotation (classifying objects, assigning policies, detecting coalitions). The IR is emitted alongside the existing binary — it does not replace it. A feedback loop allows observed mismatch to refine future compilations.

New Pass: Tier Liveness

Cross-tier lifetime analysis

Extends traditional register-allocation liveness to span memory tiers. Instead of "this buffer is live from op 4 to op 12," the analysis determines "this object should remain in HBM from layer 4 through layer 18, then is eligible for demotion."

New Pass: Coalition Detection

Co-access group identification

Identifies sets of objects that are frequently accessed together and should be co-placed. Weight tiles for a given layer family, KV blocks for sequences in the same prefix group, expert weights likely to be co-routed.

New Pass: Recompute Classifier

Spill vs. recompute cost

For each object, estimates the FLOPs cost to recompute vs. the bandwidth cost to spill and refetch. Emits a recompute_class that the path selector can use to choose between movement and arithmetic regeneration.

New Pass: Phase Annotator

Phase-sensitive urgency

Marks execution boundaries where object urgency changes — prefill → decode transition, collective windows, expert dispatch gates — and emits per-phase policy overrides that let the controller adapt without re-querying software.

In other words: the compiler stops thinking only like a kernel scheduler and starts thinking like a memory strategist — one that describes not just what to compute but what the data will need to do while that computation runs.

A concrete inference example

In LLM serving, the compiler and serving stack often know several useful things before runtime pressure appears. Weight groups for active layers will be reused. Prefix-shared KV segments have disproportionate value. Some activation buffers are short-lived and cheap to recompute. Expert routing may make a subset of weights hotter over the next few decode steps.

That means the system does not need clairvoyance. It needs a disciplined way to express what is already structurally known.

Fig. 04 Autoregressive Inference — Intent-Guided Object Behavior

Each object class follows a different timeline: weight tiles stay protected throughout; KV prefix blocks are guarded until sequence fanout drops to zero; activations are step-scoped and spill-eligible; expert weights are cold until the router signals high route probability, at which point the IR-guided prefetch pulls them to HBM ahead of demand.

Memory Intent IR is not about predicting everything. It is about refusing to throw away the structural knowledge the compiler already has — and surfacing it in a form that controllers can act on, at hardware speed, without repeated software intervention.

The example policy split is straightforward: protect shared prefix KV (hard anti-evict); keep high-reuse weights in HBM (hard deadline); allow cheap decode temporaries to spill or recompute (soft deadline, cheap recompute); and pre-stage likely-hot expert weights a few steps ahead (route-probability override in the decode phase descriptor).

Why hardware people should care

Without a structured intent channel, hardware memory orchestration is forced into two unsatisfying modes: fixed heuristics that only work for some workloads, or software-driven micromanagement that drags the control path back into microsecond territory and defeats the purpose of the hardware accelerator.

Memory Intent IR offers a cleaner architecture. The controller remains generic and fabric-agnostic. But it is no longer blind. It receives normalized object semantics at model load time and combines them with live telemetry — queue depth, bandwidth pressure, residency fill, hit history — to make placement and eviction decisions that are simultaneously fast and informed.

The hardware contract: The IR is emitted once per model load. The controller ingests it, builds its residency map, and then operates at line rate for the rest of the run. No repeated software calls. No control-path round-trips. The intent is already there.

What this means for the AI engineer

This is not only for compiler researchers or hardware architects. If it becomes real in production systems, the downstream effect for AI engineers is practical and immediate.

Reliability

More predictable latency tails

Better retention and prefetch policy means fewer surprise refetches, fewer stall cascades, and more stable p95/p99 behavior under load — especially under shared multi-tenant pressure.

Stability

Fewer policy-shaped OOM events

Some "out of memory" failures are really bad retention and spill decisions. Intent-aware policy can eliminate that class of failure — the controller knows what must stay and what can go.

Ergonomics

Less manual memory tuning

Engineers spend too much time working around opaque memory behavior with one-off knobs, cache flushes, and fragile heuristics. Intent makes the system legible and self-documenting.

Scale

Cleaner scaling behavior

As workloads grow larger, more multi-tenant, or more disaggregated, intent-aware orchestration degrades more gracefully than blind paging — the controller knows which objects matter even under pressure.

Objections worth taking seriously

"Runtime observation is enough."

Runtime observation matters and will always be part of the picture — live telemetry catches what the compiler couldn't predict. But observation is inherently lagging. By the time the system sees a pattern, the damage is already done: the wrong object was evicted, the prefetch arrived too late, the recomputation wasn't triggered in time. The best systems combine declared compiler intent with runtime correction — neither alone is sufficient.

"This will be too brittle — compiler predictions won't match runtime reality."

Only if treated as a rigid contract. In practice, it should be a weighted guidance layer with confidence levels, escape hatches, and runtime override capability. The IR says "I think this object will be needed with HIGH probability in the next 20 microseconds." The controller treats that as a strong prior, not a hard constraint. When the prior is wrong, the controller corrects from telemetry. The gain comes from getting it right the majority of the time.

"The compiler cannot know dynamic behavior."

It doesn't need omniscience. Partial structured knowledge — knowing that weight tiles for transformer layers have HIGH reuse, that shared prefix KV blocks should never be evicted while fanout is greater than one, that transient activations are CHEAP to recompute — is vastly better than treating all objects as anonymous pages. Partial knowledge with confidence levels beats pretending all bytes are equivalent.

"This sounds like metadata bloat."

A compact intent descriptor is approximately 141 bits per object (as in the MCOS-HFC residency map). For 64,000 objects, that is 1.1 MB of SRAM. Compare that with the cost of repeated misplacement: a single NVMe refetch that could have been avoided costs ~100 microseconds. A single avoidable copy across a congested RDMA link wastes bandwidth that could have been serving other requests. The metadata is not the cost. The misplacement is.

What this unlocks next

Once a Memory Intent IR exists as a real artifact, several practical directions open up immediately and naturally.

Portability

Portable memory policies

Different vendors and runtimes can consume the same intent vocabulary even if their controllers and memory fabrics differ underneath. One IR format, multiple controllers.

Hardware Integration

Model-load-time hardware ingest

Controllers like MCOS-HFC can ingest intent once at load time and operate at line rate for the entire run, instead of depending on repeated software intervention per token step.

Observability

Intent vs. observed behavior analysis

Production systems can compare predicted reuse, deadlines, and pressure classes against observed reality and automatically refine compiler passes over time — a closed feedback loop.

Economics

Explicit compute-vs-movement tradeoffs

Compilers can reason more cleanly about whether to recompute, retain, replicate, or spill based on declared object value — not crude fallback policy triggered when something has already gone wrong.

✦

Compute graphs are incomplete

A compute graph tells you what must be computed. A Memory Intent IR tells you how the objects involved in that computation are expected to behave under pressure — which ones are fragile, which are resilient, which are time-sensitive, which are cheap to reproduce, and which ones, if evicted at the wrong moment, will cascade into a stall that propagates to everything downstream.

As long as AI systems emit only the first artifact, they leave one of the central determinants of performance underspecified. And underspecified systems get patched with heuristics — fragile, opaque heuristics that work until they don't.

"The compiler should emit a second artifact: one that tells the system what data matters, how long it matters, how urgently it matters, where it would rather live, and what tradeoffs are acceptable when contention arrives. Once you see that gap, it becomes hard to unsee it." — Memory Intent IR, closing argument

The gap is real. The compiler already has the information. The question is only whether the ecosystem decides that writing it down is worth the effort. Given what bad memory policy costs at scale, the answer seems clear.

← All writings