Hardware-Resident MCOS · AI Memory Fabrics

MCOS Must Live in Hardware

From JBOD to NAS, storage became powerful when it stopped being passive. The same shift is coming for AI memory systems: data movement needs a brain, and that brain cannot live only in software.

Manish KLApril 2026~14 min readArchitecture Essay
MCOS Architecture: Software-Defined Policy + Hardware-Resident Execution POLICY LAYER HARDWARE EXECUTION LAYER MEMORY / FABRIC RESOURCES MCOS Policy software-defined observe predict assign intent policy Hardware MCOS executes at line rate DMA / RDMA Scheduler Admission / Eviction Engine Prefetch / KV Placement Logic SRAM / HBM DPU / SmartNIC NVMe / Fabric GPU / Accelerator software provides intelligence hardware provides immediacy — no CPU in the hot path line-rate execution

The earlier idea of MCOS framed it as a software system: a control layer that manages memory placement, movement, and reuse. That is directionally correct, but incomplete.

The real insight is this: MCOS cannot succeed if it lives only in software.

A useful MCOS must become a hardware-adjacent control substrate that executes movement policy at line rate, near the fabric, memory tiers, and accelerators themselves. Software alone cannot get there — not because software is weak, but because the hot path is too fast for software round-trips.

Why software-only MCOS hits a ceiling

A pure software MCOS introduces intelligence, but it also introduces latency. Even if the policy is brilliant, once every important decision must pass through the CPU, the OS scheduler, runtime locks, driver queues, and user-space callbacks, some of the benefit is lost.

observe → decide → syscall → driver → DMA → move

Every step in that chain can add cost:

Scheduling delay

The CPU and runtime are no longer observers — they become bottlenecks in the hot path.

Queueing jitter

Movement decisions arrive late, variability rises, and p99 behavior degrades even when average-case looks fine.

Lost overlap

Data could have been moving already, but waits for software to catch up. Prefetch value evaporates.

In AI systems, the bottleneck is increasingly data movement itself. Adding more software layers to the critical path can make the system smarter and slower at the same time.

The JBOD → NAS analogy

The best mental model comes from storage history. Once upon a time, storage looked like this:

Disks → OS → Application

That was JBOD: just a bunch of disks. Raw capacity existed, but intelligence was elsewhere. Then came network-attached filers:

Disks → Smart Controller / Filer → Network → Application

The key shift was not merely putting disks on a network. It was that storage stopped being passive. The filer began making real-time decisions about caching, placement, prefetch, eviction, namespace management, replication, and failure handling.

JBOD ERA Disks Host OS Application passive storage · intelligence lives only in the host no placement logic NAS ERA Disks Smart Filer cache·placement·evict·prefetch Application intelligence near data JBOD became NAS when storage got a brain. AI memory becomes a fabric when data movement gets one. The controller must live near the data — not far away in a software thread.
Figure 1. The JBOD-to-NAS transition is the best historical analogy for what AI memory systems are going through now. Intelligence moved closer to the data — and so must MCOS.

JBOD became NAS when storage got a brain.

Why AI memory is at the same moment

Today's AI infrastructure still often resembles the JBOD era. We have raw primitives — HBM, SRAM, NVMe, RDMA, DPU queues — but much of the orchestration remains manual, runtime-driven, or reactive. Memory is still treated as a set of resources to be micromanaged indirectly rather than as an intelligent subsystem in its own right.

The next step is not just better APIs. It is the emergence of an intelligent memory fabric: a system that understands hotness, reuse, deadlines, KV locality, bandwidth pressure, topology, and movement policy.

Memory must become an active subsystem, not a passive collection of tiers.

The correct architecture

MCOS should not be a software layer sitting politely on top of GPUs. It should be built as a split system:

AI Application / Runtime
        ↓
MCOS Policy Brain     ← software-defined: global view, priority, intent
        ↓
Hardware MCOS Layer   ← hardware-resident: placement, prefetch, eviction at line rate
        ↓
GPU / SRAM / HBM / DPU / NVMe / Fabric

The software side maintains the global view and installs policy. The hardware side enforces placement, prefetch, admission, eviction, and movement decisions without CPU involvement in the hot path.

What must move into hardware

These are exactly the functions that become far more valuable when enforced near the memory and fabric edges — where latency is measured in nanoseconds, not microseconds:

DMA / RDMA scheduling

Movement timing and route selection without CPU involvement in the hot path.

Admission / eviction

Promotion and demotion of hot data without round-trips through runtime software.

Prefetch triggers

Hardware stages likely-needed state before the accelerator stalls on demand fetch.

KV cache placement

Token-serving workloads need direct, fast decisions about what stays close to compute.

Reuse enforcement

A smart controller keeps hot tiles resident as long as reuse warrants it — not until eviction pressure forces it out.

Fallback handling

Graceful degradation without the CPU micromanaging every byte when the fast path is unavailable.

These decisions must execute at wire speed, not at software speed.

The new control loop

A useful MCOS becomes a distributed control loop:

observe → predict → stage → reuse → evict

But the split matters enormously:

Software-defined

Observe patterns, build a global view, learn workload behavior, assign priorities, and install policy into the hardware layer.

Hardware-enforced

Stage movement, keep hot state resident, prefetch likely-needed blocks, and evict cold state — without CPU involvement in the hot path.

That is the right balance: software provides intelligence, hardware provides immediacy.

How such a system would evolve

The architecture likely emerges in stages — each version moving more execution closer to the fabric:

MCOS Evolution Roadmap v1: Software Runtime driver hooks memory-intent APIs prototype policy value pure software v2: DPU-Assisted hot-path shifts to DPU CPU moves to control plane fabric-edge execution begins hw-assisted v3: HW-Resident line-rate movement ctrl software-defined policy hardware-enforced execution target architecture v4: Memory Appliance rack-scale memory substrate coherent shared service memory as infrastructure future state software-only is a stepping stone, not the destination
Figure 2. MCOS evolution from pure software runtime to rack-scale memory appliance. Each generation moves more execution closer to the data plane.
v1

Software runtime + driver hooks. Proves policy value, exposes memory-intent APIs.

v2

DPU-assisted movement. Hot-path execution begins shifting closer to fabric and off the host CPU.

v3

Hardware-resident controller with software-defined policy, executing movement at near line rate.

v4

Rack-scale memory appliance or substrate — memory movement as a coherent shared service.

The important thing is that software-only MCOS is a stepping stone, not the destination.

The one-line thesis

AI infrastructure becomes a memory fabric when data movement gets a brain.

A software-only MCOS is useful for proving policy and exposing a better programming model. But the real system is a hardware-resident movement controller with software-defined intelligence — the architecture that can remove bounce buffers, reduce redundant transfers, keep hot state resident, and feed accelerators at line rate.