Detailed Technical Write-up

Not static quantization. Not simple weight offload. A runtime controller for multi-tier weight residency.

Predictive Multi-Tier Weight Residency and Precision Orchestration reframes inference under HBM pressure as a runtime control problem. Instead of treating weights as static assets, it treats them as managed operating state whose precision, placement, and transfer priority must be decided continuously.

Open repository Read invention breakdown
Runtime orchestration map
SignalsLayer cadence
Execution timing reveals what will be needed soon.
SignalsMoE router history
Expert hotness becomes a live placement and prefetch input.
ControlScoring + policy engine
Computes a target state per block using demand, cost, and quality sensitivity.
ExecutionTransfer scheduler
Allocates bandwidth and aligns movement with the prefetch window.
GuardrailAccuracy veto
Rejects unsafe demotions before they land.
Fast tierHBM
Expansion tierCXL / DRAM
Key differentiatorJoint control
Sparse-model hookMoE aware
The underlying problem

HBM scarcity has made inference a placement problem as much as a compute problem

At modern model sizes, throughput is not decided only by arithmetic intensity. It is increasingly decided by whether the right weight blocks are in the right place, in the right precision, at the right time. Once HBM becomes the limiting resource, serving starts to resemble an operating-system problem for parameter state.

That is why the core thesis here is useful. Existing approaches often optimize one dimension in isolation. Static quantization focuses on footprint. Offload focuses on movement. Model parallelism focuses on topology. This invention tries to unify placement, precision, and movement under a single runtime controller.

The most interesting claim is not that memory tiers exist. It is that the controller should reason over them as one managed state space while enforcing quality boundaries.

Static quantization

Useful for reducing footprint, but fundamentally offline. It does not dynamically reassign precision based on active context, reuse horizon, routing likelihood, or live transfer pressure.

Weight offload

Moves bytes across tiers, but often without a joint policy for precision, decompression cost, safety margins, and output-quality guardrails.

Model parallelism

Spreads capacity and compute across devices, but does not itself answer the dynamic runtime question of which block deserves the fastest tier right now.

Core invention

Each managed weight block gets both a precision state and a residency state

The architecture operates on first-class managed units such as weight shards, layer blocks, or MoE experts. For each unit, the controller computes a target state using predicted reuse, next-use timing, routing likelihood, decompression cost, transfer cost, bandwidth pressure, and quality sensitivity. The result is a system that does not merely quantize and does not merely offload; it coordinates both under one policy loop.

Control inputWhat it representsWhy it changes placement
Predicted reuse horizonHow soon a block is likely needed againHot blocks deserve premium residency or in-flight promotion.
Next-use timingDistance to the next compute touchDefines whether there is enough slack for staging or prefetch.
MoE router historyObserved expert popularity and routing trendsLets sparse models drive speculative promotion intelligently.
Transfer and decompression costRecovery overhead from lower tiers or compressed statesChanges whether a block should remain warm versus cold.
Bandwidth pressureCompeting demand on the movement pathPrevents low-value traffic from blocking imminent promotions.
Quality sensitivityRisk of accuracy degradation under demotion or reduced precisionActs as a veto boundary against unsafe efficiency optimizations.
Hook 1

Closed-loop cross-layer optimization

The controller listens to model execution itself. Layer timing, concurrency, and routing behavior are not postmortem stats; they are live control inputs.

Hook 2

Accuracy-veto supervisory guardrail

Efficiency wins are allowed only when predicted quality impact remains inside configured bounds. Unsafe demotions are blocked before they happen.

Hook 3

Tier-transparent residency model

HBM, DRAM, CXL pools, peer memory, and storage are treated as one continuum rather than as ad hoc special cases.

Seven-state model

A concrete operating vocabulary across precision and residency

A strength of the concept is that it does not stay abstract. It defines an explicit state machine, which is exactly what a runtime controller needs if it is going to issue transitions in a disciplined, inspectable way.

S1Full-precision HBM
Premium placement for highly sensitive or imminently needed blocks.
S2Reduced-precision HBM
Fast-tier placement with lower footprint when quality sensitivity allows compression.
S3Compressed HBM
Near-compute holding state that shrinks footprint without full eviction.
S4Host DRAM staged
Warm spill state for reasonably fast recovery when direct HBM is too expensive.
S5Storage / cold staged
Low-priority state for blocks with weak near-term demand and acceptable retrieval overhead.
S6In-flight prefetch
Transient promotion state aligned to an upcoming execution window.
S7Recomputable fallback
Optional embodiment where hot storage is replaced by regeneration.
Why this mattersInspectable control
A state machine allows explicit thresholds, hysteresis, and auditability.
Closed-loop control

The loop is predictive, asynchronous, and cross-layer

The control loop is conceptually clean: observe, score, veto, schedule, execute. That sequencing matters because it avoids collapsing telemetry, policy, and movement into one opaque heuristic blob.

  • Observe: collect layer timing, router statistics, concurrency, and live memory pressure.
  • Score: compute a target state from demand, cost, and sensitivity signals.
  • Veto: apply precision floors and risk checks.
  • Schedule: reserve bandwidth and align promotions with the prefetch window.
  • Execute: overlap transfers and decompression with compute.
Observe → Score → Veto → Schedule → Execute Telemetry examples: - layer execution timing - request concurrency - KV / HBM pressure - MoE router history Policy ingredients: - reuse horizon - next-use timing - routing likelihood - transfer cost - decompression cost - quality sensitivity - hysteresis thresholds
Guardrails

An explicit accuracy veto is what makes the design more than an aggressive memory optimizer

Some blocks are disproportionately sensitive to precision loss or aggressive eviction. The architecture therefore keeps a supervisory guardrail path that maintains per-block quality sensitivity and minimum precision floors. When the policy engine proposes a transition that would violate those boundaries, the guardrail rejects it.

This is important because the locally optimal footprint decision is not always the globally optimal inference decision. Especially during peak HBM pressure, naive controllers tend to chase memory relief too aggressively. The veto mechanism is the antidote.

Predicted gain

Higher HBM relief, lower hot-tier pressure, more room for imminent promotions.

Predicted cost

Unsafe output drift, destabilized expert behavior, or degraded model quality on sensitive layers.

Tier-transparent hierarchy

The memory stack is expanding, and the controller is designed for that future

The residency model spans HBM, compressed near-compute states, host DRAM or peer memory, CXL-attached pooled memory, and storage-backed cold extension. That matters because the future of serving is unlikely to be “everything fits in HBM.” The design is shaped for a world where pooled and staged capacity become first-class runtime tools.

Tier 0HBM
Highest bandwidth, smallest capacity, reserved for imminently needed blocks.
Tier 1Compressed near-compute
Shrinks footprint while preserving rapid reactivation.
Tier 2DRAM / peer memory
Warm staging for shorter-latency recovery and distributed node strategies.
Tier 3CXL pool
A natural future embodiment for pooled disaggregated memory.
Tier 4Storage extension
Cold capacity extension for oversized deployments.
Bandwidth-aware scheduling

Not all transfers are equal

The scheduler distinguishes between promotions that unblock imminent compute and lower-value demotion traffic. It estimates transfer latency, decompression latency, and safety margin to define a per-block prefetch window. High-value promotions get reserved budget first; background spill traffic can be throttled or deferred.

MoE-specific advantage

Router history becomes a control signal

This is one of the most interesting hooks in the concept. In sparse models, expert popularity is not treated as a passive monitoring metric. It directly influences promotion priority and speculative prefetch. That closes the loop between routing behavior and memory placement.

Representative deployment variants

The same control model scales from a single accelerator to pooled systems

DeploymentHow the controller fitsWhy it is relevant
Single acceleratorManages HBM, compressed HBM, host DRAM, and storage spill for one device.Useful as the simplest embodiment and likely easiest validation path.
Multi-accelerator nodeUses peer memory or staged replicas to cut fetch latency across the node fabric.Relevant when local device memory is fragmented across several accelerators.
CXL-enabled memory poolTreats pooled memory as an active intermediate tier rather than a passive extension.Aligns directly with where AI infrastructure is headed.
Storage-extended servingUses colder staging for oversized models while preserving hot-path quality via guarded promotion.Important for cost-sensitive or over-capacity deployments.
Diffusion / multimodal pipelinesApplies the same managed-unit abstraction beyond decoder-only LLMs.Shows the concept is not necessarily tied to one graph family.
Closing take

The bigger idea is that model weights are becoming schedulable runtime state

The conceptual leap here is subtle but important. Traditional thinking treats parameters as mostly static assets that are loaded, partitioned, or compressed. This design treats them as managed operating state whose placement and precision can change continuously under workload feedback and quality supervision. That makes the controller feel less like a one-off optimization and more like an inference-time memory operating system.

For readers thinking about future AI systems, that framing may be the most durable part of the project. As HBM pressure rises and tiered memory becomes normal, runtime orchestration of weight state is likely to matter more, not less.