AI chips · HBM · fabrics · inference systems

Zhenwu M890: Why Alibaba Bet on Memory Fabrics, Not Just FLOPS

Everyone looked for TFLOPS. The real number was 144 GB.

The next AI infrastructure war may not be won by the fastest tensor core alone. It may be won by the best memory orchestration system.

Specs The Shift KV Math Fabric Comparison Software Moat Questions

The thesis

The Alibaba Zhenwu M890 is interesting not because it is merely another domestic AI accelerator. It is interesting because the announcement centers the same architectural reality NVIDIA, hyperscalers, and ML infra teams are converging on: inference is becoming a memory, fabric, and scheduling problem.

Chips do not win inference alone. Systems do.

144 GB

HBM3 capacity

800 GB/s

claimed inter-chip bandwidth, not HBM bandwidth

25.6 Tbps

ICN Switch 1.0 fabric

128

accelerators per AL128 rack

The Published M890 Signal

The raw public specifications are useful, but the more important column is the systems implication. Each number points toward a design center where active memory state, chip-to-chip movement, and rack-scale inference become first-class constraints.

Item	Published / Claimed Detail	Why It Matters
Memory	144 GB HBM3	Large KV-cache residency, long-context inference, higher batch concurrency, and reduced pressure to spill active state.
Inter-chip	800 GB/s claimed inter-chip bandwidth	This should be read as accelerator-to-accelerator communication bandwidth, not internal HBM bandwidth. The internal HBM bandwidth was not fully disclosed in public coverage.
Fabric	ICN Switch 1.0, 25.6 Tbps, claimed <150 ns P2P latency	Shows the product is not only a chip. It is a rack-scale scale-up fabric strategy.
Rack	Panjiu AL128, 128 accelerators	The unit of competition is moving from chip to rack, pod, topology, and memory orchestration layer.
Precision	FP32 / FP16 / FP8 / FP4	FP4 support suggests optimization for inference efficiency, compressed compute, and future long-context serving economics.
MoE	Not fully detailed publicly	MoE routing at rack scale is one of the key workloads where fabric latency, expert placement, and memory locality become decisive.

Old Story vs. New Story

Old AI hardware story

CPU

GPU

Tensor Cores

Faster Training

The old story was compute-centric: more tensor throughput, denser matrix engines, and better utilization for large training runs.

New AI infrastructure story

KV Cache

HBM

Fabric

Runtime

Tokens/sec

The new story is memory-centric: where state lives, how it moves, and whether the runtime can hide movement behind useful work.

Make the KV-Cache Math Hurt

The easiest way to understand why 144 GB HBM and a 128-accelerator rack matter is to look at active inference state. Long-context inference is not just more compute; it is more resident memory.

Illustrative workload

Approximate KV-cache footprint for a Llama-3-70B-class model, assuming FP16 KV, 80 layers, hidden size around 8192, batch size 32, and 1M-token context.

32batch size

1Mtokens/context

80layers

FP16KV state

KV cache roughly scales as:

batch_size × sequence_length × layers × 2(K,V) × hidden_dimension × bytes_per_element

≈ 84 TB

That is the theoretical active KV state for this extreme batch/context combination. Even reducing the example dramatically, KV state becomes a multi-terabyte systems problem very quickly. With 144 GB of HBM per accelerator, a few terabytes of hot KV state already implies many chips before compute is even considered.

Note: some short-form estimates online use smaller hidden dimensions, fewer active layers, smaller batch, grouped-query attention reductions, or quantized KV, producing lower figures such as a few TB. The point survives every variant: long-context batched inference rapidly turns memory residency into the bottleneck.

The Fabric May Be More Important Than the Chip

A standalone accelerator is no longer the full product. The full product is the accelerator plus fabric plus runtime plus compiler plus orchestration layer. Alibaba’s emphasis on ICN Switch 1.0 and the Panjiu AL128 rack suggests it understands that rack-scale inference is a distributed systems problem.

The new inference bottleneck chain

Request

KV in HBM

Fabric Move

Runtime Schedule

Next Token

If the claimed sub-150 ns peer-to-peer latency is real under meaningful load, it is aggressive. The critical question is whether that latency remains useful under 64-chip or 128-chip collective traffic, expert routing, KV migration, and concurrent multi-tenant inference.

How to Think About It vs. Known Systems

The right comparison is not simply “M890 vs GPU.” The better comparison is system strategy: NVIDIA-style rack-scale coherence and software maturity versus Alibaba’s domestic memory-fabric stack.

System	Memory / Fabric Emphasis	Likely Strength	Open Question
NVIDIA GB200 / NVL-class systems	Very high bandwidth NVLink/NVSwitch scale-up, mature CUDA/NCCL/TensorRT ecosystem, Grace CPU integration in some configurations.	Software depth, collectives, topology-aware scheduling, mature model serving stack.	Cost, supply constraints, export restrictions, and lock-in.
AMD MI300 / MI400-class roadmap	Large HBM pools, chiplet strategy, ROCm growth, strong memory capacity positioning.	Open ecosystem appeal and strong memory-centric accelerator roadmap.	Software maturity versus NVIDIA and rack-scale collective story.
Alibaba Zhenwu M890 + AL128	144 GB HBM3, ICN Switch 1.0, 25.6 Tbps fabric, 128-accelerator rack positioning.	Domestic cloud integration, memory-heavy inference positioning, sovereign stack control.	Exact compute throughput, HBM bandwidth, software maturity, ICN openness, and real workload benchmarks.

The Software Stack Is Still the Real Moat

Silicon gets attention, but the hard part is making the system useful under real workloads. The runtime has to make memory movement predictable, hide latency, and coordinate many accelerators without drowning in synchronization.

What the stack must solve

Graph capture for repeated inference patterns.
Prefill/decode disaggregation and scheduling.
KV-cache placement, reuse, compression, and eviction.
Topology-aware allreduce, allgather, and expert routing.
Compiler lowering into deterministic memory movement plans.
Profiling tools that expose fabric stalls and memory residency misses.

Conceptual runtime object

struct m890_fabric_context {
    uint32_t node_count;      // AL128 rack-level target
    uint32_t native_group;    // possible 64-node native switch boundary
    uint64_t kv_bytes_hot;
    uint64_t hbm_bytes_free;
    enum precision kv_format; // fp16, fp8, fp4, quantized
    enum phase phase;         // prefill, decode, agent_loop
};

m890_schedule_kv_region(ctx, token_range,
                        HBM_LOCAL | FABRIC_REPLICATED);

m890_collective_op(ctx, ALL_GATHER_KV,
                   TOPOLOGY_AWARE | LATENCY_SENSITIVE);

The Compiler Becomes a Memory Traffic Optimizer

As inference becomes more deterministic, runtime dynamism becomes expensive. Future systems increasingly benefit from compile-time or graph-time planning: where KV lives, when DMA occurs, what buffers are reused, and when data can be released.

Application / Agent Looprequests, tools, retrieval, memory

Serving Runtimebatching, prefill/decode split, admission control

Graph Capture / Compilerstatic windows, memory placement, lowering

DMA + Fabric SchedulerKV migration, collectives, synchronization

HBM / SRAM / CXL / DRAMresident state and spill tiers

Rack FabricICN, NVLink-like fabrics, RDMA, optics

That is the deeper architectural shift: the compiler is no longer only a compute scheduler. It becomes a memory traffic optimizer.

What to Ask Alibaba

The announcement is directionally important, but the unanswered details determine whether this becomes an ecosystem or a high-performance island.

What exactly is ICN?Is it custom, RDMA-like, CXL-inspired, Ethernet-compatible, optical-ready, or a proprietary scale-up island?

What is the real HBM bandwidth?Public coverage highlights inter-chip bandwidth. Internal HBM bandwidth matters enormously for decode throughput.

Does <150 ns hold under load?Point-to-point latency is one thing; 128-chip allreduce and MoE routing under contention are another.

Where is the software?The decisive layer is compiler/runtime support for KV reuse, prefill/decode disaggregation, graph capture, and fabric-aware scheduling.

What is the native topology boundary?If the native non-blocking group is smaller than the 128-accelerator rack, hierarchical scheduling becomes critical.

What workloads are optimized?Training, dense inference, MoE inference, long-context serving, and agent loops stress different parts of the system.

Market Context: Sovereign AI Infrastructure

The M890 also fits a broader market reality. Chinese cloud providers need believable domestic alternatives to constrained global accelerator supply, especially Hopper-generation systems impacted by export controls. That does not make the software problem disappear, but it gives Alibaba a strong reason to build the full stack instead of only the chip.

The strategic product is not the accelerator. It is the accelerator, fabric, runtime, compiler, and cloud deployment model as one integrated system.

Final Takeaway

The Zhenwu M890 announcement is valuable because it tells us what to look at. The meaningful numbers are not just performance claims. They are memory capacity, fabric bandwidth, rack topology, and orchestration capability.

From compute-centric AI → to memory-orchestrated infrastructure.

That is the real transition. The fastest tensor core still matters. But for long-context, agentic, multi-chip inference, the winning system may be the one that moves, places, reuses, and schedules memory better than everyone else.

Chips don’t win inference. Systems do.