Hardware-Resident MCOS · AI Memory Fabrics

MCOS-HFC: A Hardware Fabric Controller for Memory-Centric AI

Software defines the intent. Hardware enforces the residency, movement, admission, eviction, and handoff — near the fabric and the accelerator, without touching the CPU hot path.

Manish KL April 2026 ~16 min read Architecture Essay
Patent Pending

The central claim behind MCOS-HFC is straightforward but consequential: in large-scale AI systems, memory movement is no longer an implementation detail. It is the system. Once weights, KV cache, activations, expert shards, and optimizer state spill across multiple tiers, performance is defined entirely by how intelligently those bytes are staged, retained, evicted, moved, or recomputed.

Fig. 01 MCOS-HFC System Architecture — Three-Plane Overview
SOFTWARE PLANE MCOS-HFC CONTROLLER EXECUTION & TIERS Runtime / Compiler intent · lifetime · deadline MCOS-HFC Policy Ingest typed descriptors Telemetry Engine tier pressure · congestion Residency Map on-chip SRAM Admit / Evict regret counters Path Selector transfer ∨ recompute Descriptor Gen signed commands intent DPU / Execution Agent SRAM / HBM Host DRAM / CXL NVMe / Fabric desc backpressure
Three-plane architecture: software intent flows into the hardware controller, which enforces residency and issues signed movement descriptors to fabric execution agents and memory tiers.

The real problem: AI pays a byte movement tax

Most AI infrastructure discussions still start with compute. How many FLOPs are available? How many accelerators are provisioned? How wide is the interconnect? Those questions matter, but they often miss the actual source of inefficiency in modern large-scale systems.

The dominant tax in modern AI systems is the repeated movement of state across tiers that do not share the same latency, bandwidth, or semantics. A weight tile that is logically "hot" may still sit one hop too far away. A KV-cache segment may be needed one token-step from now, yet remain stranded on the wrong side of a congested path. A transient activation may occupy valuable HBM while something with much higher reuse value gets evicted.

~100×
NVMe stall penalty vs HBM refetch
64K+
AI objects tracked in 1.1MB residency map
0
CPU instructions in the movement hot path
The premise of MCOS-HFC: AI performance is increasingly shaped by data residency and movement policy, not by raw compute throughput alone. The system that controls the bytes controls the machine.

Why caches and software-only schedulers are not enough

Generic caches

Traditional caches infer future need from past accesses. That works well for general-purpose workloads, but it is a weak abstraction for AI objects with known type, lifetime, deadline, and phase semantics. A weight tile that will be reused 50 times looks the same as a transient activation that will never be touched again.

Software-only control loops

Software can reason about the problem, but it often sits too far from the hot path. By the time the CPU sees the signal, computes the policy, and issues the command, the window for an optimal action may already be gone. Context switches are counted in microseconds; memory decisions need nanoseconds.

AI systems already know much more than generic memory controllers assume. Compilers know graph structure. Runtimes know the phase of execution. Serving systems know deadlines, batch shape, token progression, and tenant policy. The problem is that this knowledge is not expressed as a first-class hardware-enforced contract.

That gap is where MCOS-HFC lives.

Fig. 02 Memory Tier Hierarchy — Latency & Bandwidth Profile
LATENCY / COST → Accelerator SRAM ~1 ns · BW: highest HBM (accelerator-attached) ~10 ns · ~3 TB/s Host DRAM / CXL-attached ~100 ns · ~600 GB/s DPU-attached / Network-attached ~1 µs NVMe / Fabric-attached Storage ~100 µs ←hotter colder→ MCOS-HFC governs all tiers
The memory hierarchy MCOS-HFC governs — from nanosecond SRAM at the hot end to 100-microsecond NVMe at the cold end. Wrong placement is the tax; the controller eliminates it.
§

What MCOS actually is

MCOS stands for Memory-Centric Operating System. In practice, it should not be read as "yet another OS kernel." It is better understood as a memory-centric policy plane for AI systems — the layer that speaks for intent.

MCOS is the layer that tells the system what an object is, how hot it is, how long it should live, how urgently it will be needed, where it would ideally reside, what fallback tiers are acceptable, whether recomputation is allowed, and what security domain governs it. That is the policy half.

The enforcement half is the HFC.

The key split: MCOS declares intent in software. The HFC enforces it in hardware. Neither works without the other — the intent without enforcement is a wish; the enforcement without intent is a blind heuristic.

What the Hardware Fabric Controller does

HFC stands for Hardware Fabric Controller. This is the hardware-resident engine that receives memory intent from software and turns it into real movement, admission, eviction, and execution decisions at line rate.

The HFC is not just a transport primitive, and it is not just a cache controller. It is a control system for multi-tier AI memory — purpose-built for the kind of structured, typed, phase-aware access patterns that large models produce.

The Five Core Operations

  1. Ingest typed memory intent descriptors from software (type, lifetime, deadline, recompute flag, security domain).
  2. Track current residency — a hardware-readable map of where every AI object lives and what state it is in.
  3. Decide what to promote, retain, demote, or deny — using regret-aware utility scoring without CPU involvement.
  4. Arbitrate among physical transfer paths and arithmetic recomputation via a shared cost function.
  5. Emit signed execution descriptors to distributed agents that physically perform the movement.

That is the architecture in full. Each of those five steps is worth examining closely.

The residency map: hardware truth about where state lives

One of the most important ideas in MCOS-HFC is the multi-tier residency map. This is not a vague software table or a cache tag array. It is a hardware-resident map, stored in dedicated on-controller SRAM, that tracks precisely where AI objects currently live and what state they are in — at line rate, without CPU involvement.

Instead of guessing from address history, the controller directly knows:

// Example residency map entry (141 bits per object)
Object ID       →  tensor:layer:expert identity hash    (64b)
Current Tier    →  SRAM | HBM | HOST | CXL | DPU | NVMe  (4b)
Hotness Score   →  compiler-supplied + runtime-adjusted  (16b)
Reuse Window    →  expected remaining accesses           (16b)
Regret Counter  →  saturating hw counter (0–65535)       (16b)
Transfer State  →  idle | inflight | completing | failed  (4b)
Security Domain →  tenant isolation domain ID            (8b)

At 141 bits per entry, a 1.1 MB SRAM budget tracks over 64,000 distinct AI objects. That is well within the on-chip budget of a modern DPU — and it means the controller can make placement decisions in a single clock cycle without touching external memory.

This is what makes the system feel like a real controller rather than a dressed-up heuristic. It operates from explicit object-aware state, not inference alone.

Regret-aware eviction instead of generic replacement

A normal memory hierarchy asks: what was used most recently, or most frequently? That is LRU. That is LFU. Those are reasonable heuristics for general-purpose computing, where access patterns are unknown and object semantics are opaque.

MCOS-HFC asks a more useful question: what will we regret evicting?

R(i) = P(reusei) × StallPenalty(tier_distancei) × RemainingLifetime(i)

This metric captures three things simultaneously that LRU and LFU cannot:

Fig. 03 Regret Counter Lifecycle — Saturating Hardware Counter
regret = 38,000 / 65,535 +increment predictor: access predicted −decrement accelerator: observed access 0 65535 (sat.) Eviction Engine reads counter at line rate min(R) → victim selected No CPU instruction needed. Counter lives in Residency Map SRAM. Eviction decision at nanosecond latency — not microseconds via software interrupt
The regret counter increments when the predictor anticipates a future access, and decrements on each observed accelerator access. The eviction engine reads it directly from SRAM — no CPU, no interrupt, no context switch.
The architectural jump: software expresses semantics, but hardware carries out regret-aware displacement at the speed of the fabric. The counter is the bridge between declared intent and real-time hardware decision.
§

Recomputation vs. transfer as a first-class decision

Most systems compare one transfer path against another: direct path, staged path, peer path, host bounce path, storage path. MCOS-HFC adds a more interesting competitor: recomputation.

The controller evaluates whether it is cheaper to fetch a piece of state or regenerate it locally — treating arithmetic work as a legitimate alternative to moving bytes across a congested fabric.

argmin( Ctransfer(path₁), Ctransfer(path₂), …, Crecompute )
Fig. 04 Movement Path Selector — Cost-Function Arbitration
Movement Decision Required GPUDirect NVMe → HBM depth: 1 DPU Staged NVMe→DPU→HBM depth: 2 RDMA Peer HBM → HBM depth: 1–2 Host Bounce via DRAM depth: 3 Recompute FLOPs / bandwidth depth: 0 ✦ argmin → winner executed non-obvious: first-class path
The Path Selector evaluates five options — including recomputation — using a shared cost function. When fabric congestion makes a layer re-execution cheaper than a fetch, the hardware routes that way automatically.

This matters because congested AI clusters regularly hit situations where the time to transfer exceeds the time to recompute. Once that becomes true, recomputation is not a fallback or a failure mode. It is a legitimate, first-class path decision evaluated by the same cost function as every physical route.

"When an NVMe-oF link is congested, it may be faster to re-execute a layer than to wait for weight transfer. MCOS-HFC makes this decision automatically — no software loop, no human tuning." — Architecture decision rationale, MCOS-HFC provisional specification

That framing is important. It moves the architecture beyond transport optimization and into full movement economics: a system that treats computation, bandwidth, and latency as fungible resources to be arbitrated over in a unified cost function.

Atomic doorbell handoff and zero-stall compute

Movement is only half the story. Coordination overhead can still destroy the benefit if the consumer must poll memory, trap into software, or wait on a slow control-plane notification. This is a commonly overlooked source of latency in otherwise well-optimized data pipelines.

MCOS-HFC addresses that with an atomic doorbell handoff. Once an execution agent completes a transfer into the target memory region, it performs a single atomic write to a doorbell register visible to the accelerator — a hardware-level completion signal that lets compute resume cleanly, without CPU involvement.

  1. Execution agent completes DMA/RDMA placement into target HBM region.
  2. Agent performs an atomic PCIe MMIO or CXL.io write to the accelerator's Doorbell Register.
  3. Accelerator observes the doorbell event — via interrupt or firmware poll — immediately.
  4. Compute kernel proceeds. Data was staged before demand. Stall is zero.
Why it matters: Without hardware signaling, the accelerator must poll memory or rely on software notification — both add latency in the tens to hundreds of microseconds. The atomic doorbell collapses that to a single-cycle hardware event. The seam between movement and compute gets tightened to the point where it effectively disappears.
§

Security, signing, and multi-tenant control

MCOS-HFC is not only a performance story. It is also a control and isolation story. In multi-tenant cloud AI systems, a memory movement command is not just a scheduling action — it is a privileged capability that can cross tenant protection boundaries and touch another tenant's state if misused.

That is why the architecture supports cryptographically signed descriptors. Each movement command can be authenticated, tied to a hardware-bound key, and restricted by policy so that unauthorized movement or memory injection is structurally impossible. An immutable hardware audit log records every state transition for compliance verification.

Descriptor signing

Each data movement command is signed by the Security Engine using a hardware-bound tenant key before dispatch to execution agents. Unsigned descriptors are rejected at the fabric boundary.

Tenant isolation

The Residency Map SRAM encodes a security domain per object entry. Promotion, demotion, and movement decisions are isolated per domain — cross-tenant movement is architecturally prevented.

This broadens the appeal of the architecture considerably. A hyperscaler is not just buying faster movement. It is buying a safer, more governable movement plane — one that can be operated as a shared infrastructure service across many tenants with strong isolation guarantees.

Why the DPU embodiment matters

DPU stands for Data Processing Unit. The DPU embodiment matters because it gives MCOS-HFC a physically credible home in the production system — not a speculative future chip, but a class of hardware that is already deployed at scale today.

A modern DPU or SmartNIC already sits at the intersection of all the boundaries the controller needs to govern: PCIe uplinks to host and accelerators, RDMA fabric endpoints, NVMe-oF storage transport, tenant isolation enforcement, and control-plane offload. It is, in other words, already positioned to host a residency map, descriptor issue engines, security controls, and a doorbell-capable completion path.

Why it fits

The DPU sits on the data path between host, accelerator, fabric, and storage. It is already trusted with infrastructure responsibilities. It has the on-chip SRAM budget, crypto engines, and line-rate processing capability MCOS-HFC requires.

Why it matters commercially

It transforms MCOS-HFC from an abstract controller concept into a deployable architectural unit for real AI clusters — one that can be shipped as a hardware SKU, provisioned per node, and updated via firmware.

The DPU is where memory policy becomes infrastructure. It is the difference between a research architecture and a product.
§

Why this matters now

The old model of system design assumed compute scarcity first and memory management second. AI has inverted that assumption quietly but decisively. Once models grow past the capacity of a single accelerator's HBM, once contexts stretch to hundreds of thousands of tokens, once serving systems juggle hundreds of concurrent KV-cache segments across a fabric, the scheduling of bytes starts to dominate the quality of the machine.

That is why AI needs something closer to an operating system for memory. Not an OS in the narrow historical sense — not a kernel, not a scheduler for CPU threads — but a system-level control plane that understands object semantics, residency, urgency, tier distance, path economics, and safe enforcement at line rate.

"Future AI systems will not be differentiated only by faster chips. They will be differentiated by who best controls the movement, residency, and reuse of state across memory and fabric tiers." — MCOS-HFC design thesis

That is the role MCOS-HFC is trying to define. And the window for defining it is now — before the default answers get locked into the wrong abstractions, and before the memory movement tax becomes a permanent cost of doing business at scale.

Closing thought

MCOS-HFC is ultimately an argument about where complexity belongs. Today, too much of the burden sits in scattered runtimes, opaque heuristics, and reactive software loops that are perpetually too slow for the problem they are trying to solve. A memory-centric AI machine needs a cleaner contract: software declares intent, hardware tracks truth, and the fabric executes policy with enough precision to keep compute fed without wasting movement budget.

If AI is becoming a memory system with compute attached, then the control plane for memory is no longer optional. It is foundational — and it belongs in hardware, near the fabric, operating at line rate.

← All writings
© 2026 Manish KL