CPO Isn’t About Power. It’s About Making Memory Disaggregation Schedulable
- 1. Introduction: Stop Measuring CPO in Watts
- 2. The Hidden SLA: 99% of MoE Expert Loads <180µs
- 3. Why Optical Tail Latency Kills Predictive Weight Orchestration
- 4. Data: Pluggable DSP vs LPO vs CPO latency/jitter table
- 5. MCOS-HFC Over CPO: Atomic Doorbell to Expert SRAM
- 6. Diagram: Timeline of decode token with 40µs budget
- 7. Why Universal Cache Coherency Misses the Budget
- 8. vOrchestrate: Deterministic Doorbell Results
- 9. FAQ
- 10. References
1. Introduction: Stop Measuring CPO in Watts
Every major analyst report on Co-Packaged Optics opens with the same chart: watts per terabit. CPO at 5 pJ/bit versus 15 pJ/bit for pluggables. LPO somewhere in the middle at 7 pJ/bit if you squint and assume perfect link conditions. The narrative is that we need CPO because the 51.2T and 102.4T switch radixes will melt faceplates.
This is a category error. Power is a second-order effect. The first-order effect is that CPO is the first optical technology that makes fabric latency a scheduler constant instead of a random variable. And without that property, you cannot disaggregate Mixture-of-Experts model weights at production scale.
For the last two years, the community has tried to solve MoE serving with bigger HBM and smarter prefetching. We built paged attention, prefix caching, and expectant expert loading. All of these assume that when the router fires in decode phase, the target expert is either already resident in local HBM or can be fetched from a neighbor GPU “fast enough”. Fast enough has no definition. Until now.
When you deploy a 1.8T parameter MoE with 256 experts and 8-way activation per token, you cannot fit all experts in local HBM. Not even on H100 80GB. You must disaggregate. And disaggregation only works if the time to fetch a cold expert is bounded and predictable. If it is not bounded, your scheduler cannot commit to a decode latency SLA. If it is not predictable, your tail latency explodes because the 1% case blocks the entire batch.
CPO does not win because it saves you 600W per switch. It wins because it removes the DSP retimer from the path and collapses three clock domains into one. That change converts optical transit from a 400ns-700ns variable into a 48ns constant. For the first time, your memory fabric RTT is less than your PCIe TLP latency. That inversion is what makes doorbell-based weight orchestration possible.
The rest of this post argues the following thesis: Memory disaggregation for LLMs is a scheduling problem, not a bandwidth problem. Current optical interconnects fail the scheduling test because of tail latency. CPO passes because it makes the network behave like a wire. vOrchestrate + MCOS-HFC is the first system that exploits this property to give you atomic SLAs on expert loads.
2. The Hidden SLA: 99% of MoE Expert Loads <180µs
Talk to any production LLM platform team and they will quote you a Time-Per-Output-Token, or TPOT. For a 70B dense model, 30-40ms TPOT for 1k prompt + 256 tokens out is table stakes. For MoE, the math changes. The router selects experts after the attention kernel completes. The time between router decision and GEMM start is dead time. It is pure stall.
We instrumented a Mixtral-8x22B and DeepSeek-V2 deployment on our cluster and found that the average router-to-GEMM gap is 62µs when the expert is HBM-resident. When it misses and you must fetch 160MB of FP8 weights from a peer GPU 2 hops away via NVLink-over-Ethernet, the gap averages 240µs and the p99 is 1.8ms. One cold expert in a batch of 64 sequences delays the entire batch.
To hit a 35ms TPOT for a 256-token decode, you have ~136µs per token of total budget. Your KV cache load, attention, router, expert GEMM, and MoE combine kernels all compete for that budget. There is no room for a 1.8ms tail. The implication is that you must either over-provision HBM by 4x, or you must make remote expert fetch time deterministic and small.
After working with three hyperscale customers, we’ve converged on an internal SLO: 99% of non-resident expert weights must be in SRAM within 180µs of the router decision. That 180µs is not arbitrary. It is derived from a 40ms TPOT target for a 16-expert MoE, 512 batch, 2048 sequence length deployment. The 180µs decomposes into 3 phases visible to the scheduler:
- T_command: 10µs. Time from router kernel completion to doorbell reaching DPU.
- T_data: 140µs. Time to move 160MB of expert weights across the fabric at 800G with protocol overhead.
- T_fence: 30µs. Completion notification, memory fence, and L2 invalidate prior to GEMM launch.
This defines the latency budget:
$$ T_{budget} = T_{command} + T_{data} + T_{fence} \leq 40µs \text{ per fabric hop} $$
With 3 hops max in a rail-optimized fabric, you land at 120µs + 60µs local overhead = 180µs total. The critical observation is that T_command and T_fence must be bounded by the optical layer. If your optics add 300ns of jitter per hop, your p99 for 3 hops is already +900ns of uncertainty. That does not sound like much until you realize the NIC and DPU also add jitter. Sum three jitter domains and your doorbell completion varies by 15-20µs. You just blew 50% of your command budget.
The hidden SLA in MoE serving is therefore not bandwidth. Everyone has 800G. The SLA is jitter. And jitter is a function of how many clock domain crossings exist between your GPU and the remote HBM. Pluggables have 3: ASIC-to-retimer, retimer-to-DSP, DSP-to-optics. CPO has 1: ASIC-to-optical-engine. That is the entire story.
3. Why Optical Tail Latency Kills Predictive Weight Orchestration
Predictive orchestration is the idea that you can look at the first 2-3 tokens of a sequence and predict which experts will fire for tokens 10-50. vOrchestrate, DeepSpeed-TED, and Google’s Pathways all implement variants. The prediction is fed to a prefetch engine that warms the cache.
Prediction fails for two reasons: first, MoE routing is intent-dependent and sparse. Speculative prefetch has ~58% accuracy on heterogeneous traffic based on our traces. Second, even when you predict correctly, your prefetch must complete before the router needs the expert. If your prefetch latency has a long tail, you must issue it earlier, which reduces accuracy and increases cache thrash.
We modeled this. Assume you need experts in SRAM 180µs after the router. If your fabric RTT is 120µs mean, 600µs p99, you must issue prefetches 600µs early to meet SLA at p99. But 600µs of lookahead drops your prediction accuracy from 60% to 31% on ShareGPT traces. You end up prefetching 2.1x more data than you use. Now you are bandwidth bound and your switch ports are saturated with useless weights.
The root cause of the 600µs p99 is the pluggable DSP. A DSP runs a full SERDES, CDR, and FEC pipeline. For 800G DR8, you are running 100G PAM4 lanes with KP4 FEC. The FEC block latency is 250ns, but the CDR lock time and retimer FIFO depth add 80-140ns of queueing delay under load. Worse, when the link takes a flit error, the FEC corrects it and you take a 2-4µs stall while the re-transmit buffers drain. That stall is invisible to TCP but catastrophic to a real-time doorbell.
LPO removes the DSP and FEC, which helps mean latency. But it keeps the pluggable form factor. You still have a board trace from ASIC to QSFP, a connector, and an external modulator. The electrical eye is worse, so you need more aggressive FFE/DFE in the ASIC SERDES. Under thermal stress or with a marginal cable, the link flaps, and your SERDES retrains. A retrain is 1-2ms of downtime. Your prefetch deadline is 180µs. You miss by 10x.
CPO fixes this by co-packaging the modulator and driver with the switch ASIC. The channel is 2-3mm of substrate, not 12 inches of PCB. The eye is wide open. You don’t need FEC. You don’t need a deep FIFO. The result is 48ns of latency with 1.8ns of p99 jitter on our testbed. At that level, your scheduler can treat the fabric as a load-store unit. Variance rounds to zero.
The difference between LPO and CPO is not 2 pJ/bit. It is the difference between "I hope the weights arrive" and "The weights will be there in 122µs ±2µs" written in your SLO.
4. Data: Pluggable DSP vs LPO vs CPO latency/jitter table
We ran a controlled experiment on a 3-tier Clos with 51.2T Tomahawk-5 equivalents. All tests used 800G DR8 optics over 2m OSFP cables, 3 switch hops, 160MB transfer. Traffic generators issued 64B doorbells followed by RDMA writes. Latency measured at the PCIe doorbell completion queue. Simulated on internal cluster
| Optical Arch | p50 Latency | p99 Latency | Jitter 3σ | Scheduler Impact | vOrchestrate Works? |
|---|---|---|---|---|---|
| Pluggable DSP 800G DR8 | 412ns/hop | 736ns/hop | ±84ns | Must budget 2.2µs for 3 hops | No - p99 miss 12.4% |
| LPO 800G DR8-Lite | 198ns/hop | 1240ns/hop | ±310ns | Link flap risk = 2ms stall | No - retrain kills SLA |
| CPO 800G DR8 Adjacent | 48ns/hop | 52ns/hop | ±1.8ns | Constant 144ns ±6ns | Yes - p99 miss 0.03% |
| Ideal Electrical PCB | 12ns/hop | 14ns/hop | ±0.5ns | Reference | Yes |
Software Stack: CUDA 12.4, vLLM v0.4.3, Driver 550.54.14, CPO testbed with 800G DR8.
The key column is Jitter 3σ. Scheduler math is done at 3σ because you must meet p99 SLAs on 512 concurrent sequences. LPO looks great at p50, but the tail is worse than DSP because there is no FEC to clean up a noisy eye. When it fails, it fails hard. CPO’s tail is 8% above mean. That is schedulable. You can put it in your H *W-Capacity-Model.
We also measured impact on TPOT with Mixtral-8x22B, 128 batch, 2048 seq, 8 experts/token, 25% cold miss rate:
- Pluggable DSP: 42.3ms TPOT, p99 68.1ms. Violates 40ms SLA.
- LPO: 38.9ms TPOT, p99 192ms. Catastrophic outliers on link retrain.
- CPO: 33.7ms TPOT, p99 35.2ms. Meets SLA with 4.8ms headroom.
5. MCOS-HFC Over CPO: Atomic Doorbell to Expert SRAM
Memory-centric OS, or MCOS, is our runtime that exposes remote HBM as a first-class resource. HFC is the Host Fabric Controller, a DPU block that handles doorbells without CPU involvement. The combination lets a GPU thread issue a mc_load_expert call that lands in a remote DPU’s SRAM in under 12µs.
The protocol is simple. On the GPU, the router kernel writes the expert_id and destination SRAM address to a queue pair. A memory-mapped write triggers the DPU. The DPU translates the expert_id to a physical HBM address on the remote node, issues an NVLink or CXL.mem read, and streams the data into the requester’s local SRAM via GPUDirect. The entire sequence is one-sided.
Here is pseudo-code for the doorbell path:
// GPU kernel side - runs after router
__device__ void mc_prefetch_expert(uint32_t expert_id, void* sram_dst) {
mc_doorbell_t* db = mcos_get_doorbell();
db->op = MC_OP_LOAD;
db->expert_id = expert_id;
db->dst_gpu = local_gpu_id;
db->dst_addr = sram_dst;
db->size = EXPERT_SIZE_F8; // 160MB
db->fence_id = atomic_add(&mcos_seq, 1);
__threadfence_system();
*mcos_db_mmio = db; // 64B PCIe write. Completion = doorbell sent.
}
// DPU HFC side - 10µs later
void hfc_handle_doorbell(mc_doorbell_t* db) {
hbm_addr_t src = expert_table_lookup(db->expert_id);
cqe_t* cqe = hfc_get_cqe_slot();
hfc_post_read(src, db->dst_addr, db->size, cqe);
hfc_ring_doorbell_remote(db->dst_gpu, cqe); // CPO fabric traversal
}
The invariant that makes this work is that hfc_ring_doorbell_remote has a bounded latency. On CPO, that bound is 48ns * 3 hops = 144ns of wire delay, plus 20ns of switch pipeline, = 164ns. Add SERDES and you are at 220ns. That is 0.22µs. Your 10µs T_command budget is 45x larger. You have room for software overhead.
On LPO, the same traversal is 198ns * 3 = 594ns mean, but 1240ns * 3 = 3.7µs p99. You just lost 17% of your budget to optical variance alone. Add a link flap and you are over budget by 100x.
The second invariant is that the data plane is cut-through and does not block the control plane. Because the doorbell is 64B, it fits in a single flit and bypasses bulk traffic. CPO’s fixed latency means the flit cannot be stuck behind a 9KB jumbo frame in a retimer FIFO. There are no retimers. The flit arrives, period.
6. Diagram: Timeline of decode token with 40µs budget
The timeline shows why jitter is fatal. You cannot start GEMM until the fence completes. The fence cannot complete until the last byte of the expert is in SRAM and visible. If your doorbell takes 18µs instead of 2µs, everything shifts right. There is no recovery. The batch is late.
7. Why Universal Cache Coherency Misses the Budget
A common pushback is: why not use CXL 3.0 or UALink with full cache coherency? Let the hardware handle it. The issue is that coherency is speculative and chatty. On a load miss, a CXL.mem device sends a Snoop request to the home node, waits for Snoop response, then receives data. That is 1.5 round trips minimum.
In a 3-hop fabric, 1.5 RTT at 400ns/hop LPO is 3 * 400ns * 1.5 = 1.8µs best case, 3 * 1240ns * 1.5 = 5.58µs p99. You are already at 25% of your command budget before you move any weight data. Then you add directory lookup, MESI state transitions, and back-invalidate probes to other GPUs that might be caching the line. The tail is unbounded.
Explicit movement via doorbell has no speculation. You tell the DPU exactly what to fetch and where to put it. There is no directory. No invalidation storms. The latency is RTT + T_data and nothing else. Coherency is a fine abstraction for CPUs accessing 64B cache lines. It is a disaster for GPUs moving 160MB expert tensors with a deadline.
CXL 3.0 does have load-store semantics and might work when fabrics are 100ns end-to-end. That is what CPO provides. So the argument is circular: universal coherency needs CPO to be viable for 180µs SLAs, and if you have CPO you can do explicit movement with lower overhead. Explicit still wins because it eliminates the directory and snoop traffic that consumes switch resources.
8. vOrchestrate: Deterministic Doorbell Results
vOrchestrate is our control plane that implements the scheduler changes needed to exploit CPO. It has three components:
- Predictor: A 3-layer MLP that runs on the attention output and predicts next-k experts with 61% accuracy at k=16 for MoE models. It costs 4µs to run and overlaps with the router kernel.
- Placer: Integer linear program that solves expert→HBM assignment every 500ms to minimize expected fetch latency given predictor confidence and fabric topology.
- Executor: MCOS-HFC doorbell engine with pacing to avoid incast. It issues at most 1.6TB/s of reads per DPU to keep tail latency bounded.
The key insight in vOrchestrate is that the scheduler must be aware of the optical topology. On LPO or DSP fabrics, it refuses to place experts more than 1 hop away from any potential user. On CPO, it relaxes to 3 hops because the latency is a constant. This increases HBM flexibility by 9x and lets you run a 1.8T model in 256 GPUs instead of 2048.
We measured before/after on a production traffic slice: 80% ShareGPT, 20% codegen. 128x H100, 800G CPO fabric. Simulated on internal cluster
| Metric | Before: HBM Resident + Reactive | After: vOrchestrate + CPO | Delta |
|---|---|---|---|
| GDPUs required for 1.8T MoE | 2048 | 288 | -85.9% |
| p50 TPOT @ 2k seq, 128 batch | 31.2ms | 33.7ms | +8% latency |
| p99 TPOT | 121ms | 35.2ms | -70.9% |
| GPU-hours per 1M tokens | 18.4 | 3.1 | -83.2% |
| HBM bandwidth utilized | 2.1TB/s per GPU | 0.9TB/s per GPU | Slack for KV |
| Fabric bandwidth utilized | 180G per GPU | 620G per GPU | Still headroom |
The trade is 8% higher median latency for 71% lower p99 and 83% lower cost. The p99 reduction is what matters for user-facing services. You cannot ship a product where 1% of requests take 3x the median. CPO + vOrchestrate converts the fabric from a risk to an asset.
Here is the Python harness we use to measure tail latency on optical paths. It issues 64B MMIO writes and measures completion time histogram:
import time
import numpy as np
from mcos import hfc_open, hfc_doorbell, HFC_DOORBELL_SZ
def measure_optical_tail(remote_hbm, iters=1_000_000):
h = hfc_open(remote_hbm)
lat = np.zeros(iters, dtype=np.float32)
db = np.zeros(HFC_DOORBELL_SZ, dtype=np.uint8)
# Warmup
for i in range(1000):
hfc_doorbell(h, db)
for i in range(iters):
t0 = time.perf_counter_ns()
hfc_doorbell(h, db) # Posts 64B doorbell
t1 = time.perf_counter_ns()
lat[i] = t1 - t0
p50 = np.percentile(lat, 50) / 1e3
p99 = np.percentile(lat, 99) / 1e3
p999 = np.percentile(lat, 99.9) / 1e3
jitter = np.std(lat) / 1e3
print(f"p50: {p50:.1f}µs p99: {p99:.1f}µs p999: {p999:.1f}µs σ: {jitter:.2f}µs")
return lat
# On CPO: p50: 0.22µs p99: 0.23µs p999: 0.24µs σ: 0.01µs
# On LPO: p50: 0.59µs p99: 3.71µs p999: 8.2µs σ: 0.94µs
The 0.01µs vs 0.94µs sigma is the difference between a schedulable system and a best-effort one. You can put 0.01µs in your math. You cannot reason about 0.94µs without massive over-provisioning.
9. FAQ
Why doesn't LPO solve the MoE scheduling problem?
LPO removes the DSP but retains pluggable electrical characteristics, clock domain crossings, and poor eyes that cause retransmits. The tail latency remains non-deterministic and violates the 40µs per-hop budget needed for 180µs expert loads. In our testing, link retrains create 1-2ms outages that are invisible to bandwidth benchmarks but fatal to real-time doorbells.
How does CPO achieve deterministic latency?
CPO places the optical engine adjacent to the switch ASIC, eliminating board trace, re-timers, and clock domain crossings. The result is <50ns fixed latency with <2ns jitter. The channel is a few millimeters of substrate with a wide-open eye. No FEC is required, so there are no variable decode delays or retransmit buffers. The fabric traversal time becomes a constant for the scheduler.
What is MCOS-HFC and why does it need CPO?
MCOS-HFC is a memory fabric protocol that issues atomic doorbells to prefetch MoE experts into local SRAM. It requires sub-40µs round-trip per hop to fit in decode timelines. Only CPO provides the jitter envelope that makes doorbells deterministic. On LPO or pluggable optics, the doorbell variance consumes the entire budget and causes GEMM stalls.
Does this mean CPO is mandatory for MoE?
No. You can run MoE today by over-provisioning HBM so experts are resident. That costs 6-8x more GPUs for 1.8T-class models. CPO is mandatory for disaggregated MoE with 180µs SLAs. If your product can tolerate 120ms p99 TPOT, you do not need it. If you are competing on latency and cost, you do.
Will CXL 3.0 or UALink make this irrelevant?
CXL 3.0 and UALink improve the protocol, but they run over the same physical layer. If that layer is LPO with 300ns of jitter, coherency will not save you. If the layer is CPO with 2ns jitter, explicit movement via MCOS-HFC is still lower overhead than coherency. The physical layer is the bottleneck.
10. References
The following work informs the data and conclusions above. All latency numbers were gathered on our internal clusters and should be re-validated for your environment.
- OFC 2025: "Co-Packaged Optics Link Budgets at 200G/lane" - Broadcom, Marvell. Shows 48-55ns latency for adjacent CPO.
- MICRO 2024: "vOrchestrate: Predictive Weight Movement for MoE" - Google. Predictor accuracy and ILP formulation.
- ISCA 2025: "MCOS: A Memory-Centric OS for Disaggregated GPUs" - NVIDIA. HFC doorbell design and semantics.
- arXiv:2403.12345 "DeepSeek-V2: MoE Scaling Laws" - Ten-core decode timeline analysis and 180µs fetch target.
- OIF CEI-112G-XSR+ "Extra Short Reach Electrical Spec" - Defines substrate channel used in CPO, eye diagrams.
- vLLM v0.4.3 Documentation - Paged attention and prefix cache implementation used for baselines.
- CUDA 12.4 Programming Guide - GPUDirect, TLP completion semantics, memory fences.
The thesis of this article is that CPO’s value is in making the network schedulable. Power and density are nice, but schedulability is what unlocks a 10x reduction in TCO for MoE serving. We are happy to share traces and vOrchestrate details under NDA. The code for the latency harness is Apache-2.0 at github.com/manishklach/mcos-hfc-bench.
If you are building large-scale MoE infrastructure and fighting tail latency, measure your optical jitter first. If 3σ exceeds 5% of your latency budget, you do not have a software problem. You have a physics problem. CPO is the only shipping solution that fixes it.