← All writings
AI chips · HBM · fabrics · inference systems

Zhenwu M890: Why Alibaba Bet on Memory Fabrics, Not Just FLOPS

Everyone looked for TFLOPS. The real number was 144 GB.

The next AI infrastructure war may not be won by the fastest tensor core alone. It may be won by the best memory orchestration system.

The thesis

The Alibaba Zhenwu M890 is interesting not because it is merely another domestic AI accelerator. It is interesting because the announcement centers the same architectural reality NVIDIA, hyperscalers, and ML infra teams are converging on: inference is becoming a memory, fabric, and scheduling problem.

Chips do not win inference alone. Systems do.
144 GB
HBM3 capacity
800 GB/s
claimed inter-chip bandwidth, not HBM bandwidth
25.6 Tbps
ICN Switch 1.0 fabric
128
accelerators per AL128 rack

The Published M890 Signal

The raw public specifications are useful, but the more important column is the systems implication. Each number points toward a design center where active memory state, chip-to-chip movement, and rack-scale inference become first-class constraints.

ItemPublished / Claimed DetailWhy It Matters
Memory144 GB HBM3Large KV-cache residency, long-context inference, higher batch concurrency, and reduced pressure to spill active state.
Inter-chip800 GB/s claimed inter-chip bandwidthThis should be read as accelerator-to-accelerator communication bandwidth, not internal HBM bandwidth. The internal HBM bandwidth was not fully disclosed in public coverage.
FabricICN Switch 1.0, 25.6 Tbps, claimed <150 ns P2P latencyShows the product is not only a chip. It is a rack-scale scale-up fabric strategy.
RackPanjiu AL128, 128 acceleratorsThe unit of competition is moving from chip to rack, pod, topology, and memory orchestration layer.
PrecisionFP32 / FP16 / FP8 / FP4FP4 support suggests optimization for inference efficiency, compressed compute, and future long-context serving economics.
MoENot fully detailed publiclyMoE routing at rack scale is one of the key workloads where fabric latency, expert placement, and memory locality become decisive.

Old Story vs. New Story

Old AI hardware story

CPU
GPU
Tensor Cores
Faster Training

The old story was compute-centric: more tensor throughput, denser matrix engines, and better utilization for large training runs.

New AI infrastructure story

KV Cache
HBM
Fabric
Runtime
Tokens/sec

The new story is memory-centric: where state lives, how it moves, and whether the runtime can hide movement behind useful work.

Make the KV-Cache Math Hurt

The easiest way to understand why 144 GB HBM and a 128-accelerator rack matter is to look at active inference state. Long-context inference is not just more compute; it is more resident memory.

Illustrative workload

Approximate KV-cache footprint for a Llama-3-70B-class model, assuming FP16 KV, 80 layers, hidden size around 8192, batch size 32, and 1M-token context.

32batch size
1Mtokens/context
80layers
FP16KV state

KV cache roughly scales as:

batch_size × sequence_length × layers × 2(K,V) × hidden_dimension × bytes_per_element
≈ 84 TB

That is the theoretical active KV state for this extreme batch/context combination. Even reducing the example dramatically, KV state becomes a multi-terabyte systems problem very quickly. With 144 GB of HBM per accelerator, a few terabytes of hot KV state already implies many chips before compute is even considered.

Note: some short-form estimates online use smaller hidden dimensions, fewer active layers, smaller batch, grouped-query attention reductions, or quantized KV, producing lower figures such as a few TB. The point survives every variant: long-context batched inference rapidly turns memory residency into the bottleneck.

The Fabric May Be More Important Than the Chip

A standalone accelerator is no longer the full product. The full product is the accelerator plus fabric plus runtime plus compiler plus orchestration layer. Alibaba’s emphasis on ICN Switch 1.0 and the Panjiu AL128 rack suggests it understands that rack-scale inference is a distributed systems problem.

The new inference bottleneck chain

Request
KV in HBM
Fabric Move
Runtime Schedule
Next Token

If the claimed sub-150 ns peer-to-peer latency is real under meaningful load, it is aggressive. The critical question is whether that latency remains useful under 64-chip or 128-chip collective traffic, expert routing, KV migration, and concurrent multi-tenant inference.

How to Think About It vs. Known Systems

The right comparison is not simply “M890 vs GPU.” The better comparison is system strategy: NVIDIA-style rack-scale coherence and software maturity versus Alibaba’s domestic memory-fabric stack.

SystemMemory / Fabric EmphasisLikely StrengthOpen Question
NVIDIA GB200 / NVL-class systemsVery high bandwidth NVLink/NVSwitch scale-up, mature CUDA/NCCL/TensorRT ecosystem, Grace CPU integration in some configurations.Software depth, collectives, topology-aware scheduling, mature model serving stack.Cost, supply constraints, export restrictions, and lock-in.
AMD MI300 / MI400-class roadmapLarge HBM pools, chiplet strategy, ROCm growth, strong memory capacity positioning.Open ecosystem appeal and strong memory-centric accelerator roadmap.Software maturity versus NVIDIA and rack-scale collective story.
Alibaba Zhenwu M890 + AL128144 GB HBM3, ICN Switch 1.0, 25.6 Tbps fabric, 128-accelerator rack positioning.Domestic cloud integration, memory-heavy inference positioning, sovereign stack control.Exact compute throughput, HBM bandwidth, software maturity, ICN openness, and real workload benchmarks.

The Software Stack Is Still the Real Moat

Silicon gets attention, but the hard part is making the system useful under real workloads. The runtime has to make memory movement predictable, hide latency, and coordinate many accelerators without drowning in synchronization.

What the stack must solve

  • Graph capture for repeated inference patterns.
  • Prefill/decode disaggregation and scheduling.
  • KV-cache placement, reuse, compression, and eviction.
  • Topology-aware allreduce, allgather, and expert routing.
  • Compiler lowering into deterministic memory movement plans.
  • Profiling tools that expose fabric stalls and memory residency misses.

Conceptual runtime object

struct m890_fabric_context {
    uint32_t node_count;      // AL128 rack-level target
    uint32_t native_group;    // possible 64-node native switch boundary
    uint64_t kv_bytes_hot;
    uint64_t hbm_bytes_free;
    enum precision kv_format; // fp16, fp8, fp4, quantized
    enum phase phase;         // prefill, decode, agent_loop
};

m890_schedule_kv_region(ctx, token_range,
                        HBM_LOCAL | FABRIC_REPLICATED);

m890_collective_op(ctx, ALL_GATHER_KV,
                   TOPOLOGY_AWARE | LATENCY_SENSITIVE);

The Compiler Becomes a Memory Traffic Optimizer

As inference becomes more deterministic, runtime dynamism becomes expensive. Future systems increasingly benefit from compile-time or graph-time planning: where KV lives, when DMA occurs, what buffers are reused, and when data can be released.

Application / Agent Looprequests, tools, retrieval, memory
Serving Runtimebatching, prefill/decode split, admission control
Graph Capture / Compilerstatic windows, memory placement, lowering
DMA + Fabric SchedulerKV migration, collectives, synchronization
HBM / SRAM / CXL / DRAMresident state and spill tiers
Rack FabricICN, NVLink-like fabrics, RDMA, optics

That is the deeper architectural shift: the compiler is no longer only a compute scheduler. It becomes a memory traffic optimizer.

What to Ask Alibaba

The announcement is directionally important, but the unanswered details determine whether this becomes an ecosystem or a high-performance island.

What exactly is ICN?Is it custom, RDMA-like, CXL-inspired, Ethernet-compatible, optical-ready, or a proprietary scale-up island?
What is the real HBM bandwidth?Public coverage highlights inter-chip bandwidth. Internal HBM bandwidth matters enormously for decode throughput.
Does <150 ns hold under load?Point-to-point latency is one thing; 128-chip allreduce and MoE routing under contention are another.
Where is the software?The decisive layer is compiler/runtime support for KV reuse, prefill/decode disaggregation, graph capture, and fabric-aware scheduling.
What is the native topology boundary?If the native non-blocking group is smaller than the 128-accelerator rack, hierarchical scheduling becomes critical.
What workloads are optimized?Training, dense inference, MoE inference, long-context serving, and agent loops stress different parts of the system.

Market Context: Sovereign AI Infrastructure

The M890 also fits a broader market reality. Chinese cloud providers need believable domestic alternatives to constrained global accelerator supply, especially Hopper-generation systems impacted by export controls. That does not make the software problem disappear, but it gives Alibaba a strong reason to build the full stack instead of only the chip.

The strategic product is not the accelerator. It is the accelerator, fabric, runtime, compiler, and cloud deployment model as one integrated system.

Final Takeaway

The Zhenwu M890 announcement is valuable because it tells us what to look at. The meaningful numbers are not just performance claims. They are memory capacity, fabric bandwidth, rack topology, and orchestration capability.

From compute-centric AI → to memory-orchestrated infrastructure.

That is the real transition. The fastest tensor core still matters. But for long-context, agentic, multi-chip inference, the winning system may be the one that moves, places, reuses, and schedules memory better than everyone else.

Chips don’t win inference. Systems do.