1. Why FLOPS stopped being enough
The AI industry spent its first wave talking as if the main problem were arithmetic throughput. That made sense when the question was whether accelerators could do enough matrix math to keep frontier models training at all. But once clusters became large, multi-tiered, and communication-heavy, the bottleneck started moving outward.
The modern AI system is no longer one chip solving one problem. It is a choreography problem across accelerators, high-radix switches, retimers, optical modules, NICs, host memory, local storage, and remote racks. The system’s effective performance is increasingly controlled by how expensive it is to move activations, gradients, checkpoints, expert traffic, optimizer state, and inference context across the fabric.
2. What interconnect power density really means
“Interconnect power density” is not just watts per module. It is the compounded cost of pushing more and more communication through a fixed amount of electrical escape, board area, package margin, cooling headroom, and rack envelope.
Compute-centric thinking
- Add more accelerators
- Increase per-chip throughput
- Assume the network will scale behind it
Fabric-centric reality
- Every extra bit moved consumes power twice: once logically, once thermally
- Signal conditioning, retimers, DSPs, and packaging all accumulate
- Cooling, serviceability, and layout become first-order architectural constraints
3. A practical power-budget decomposition
The easiest way to make this concrete is to stop thinking of “network power” as one line item. In real AI systems, the transport path often includes several distinct budgets that each rise under scale pressure.
Interconnect cost is a stack, not a single number
| Budget component | What it does | Why it grows | Architectural consequence |
|---|---|---|---|
| SerDes / PHY | Drives and receives high-speed electrical lanes | Higher lane rates and denser escape stress equalization and signal integrity | Package edges and local board design become harder |
| Retimers / signal conditioning | Recover margin on difficult electrical paths | Longer or noisier traces need more help | Extra power and extra heat for mere transport |
| DSP / optical module logic | Supports optical signaling and recovery | Longer reach and more complex modulation raise burden | Module-level watts become a systems concern |
| Switch silicon | Moves traffic through high-radix fabrics | More ports and more concurrent flows raise radix and scheduling pressure | Fabric design becomes topology-sensitive |
| Cooling overhead | Removes heat generated by all of the above | Heat is concentrated in dense zones, not spread evenly | Transport watts cascade into cooling watts |
4. The hidden topology tax of collectives
AI clusters do not move traffic uniformly. They produce synchronized bursts: all-reduce, all-gather, expert dispatch, parameter synchronization, checkpoint writes, and remote state fetches. Those patterns are topology-sensitive. A design that looks fine under averaged throughput can still perform badly if collective phases line up with the wrong fabric shape.
Collective phase
Many devices communicate in structured bursts, not random flows.
Topology stress
Hot links, oversubscribed stages, and switch contention appear unevenly.
Power and heat spike
Transport components light up precisely when the workload is most synchronized.
Effective compute loss
Accelerators stall waiting on movement rather than arithmetic.
5. Order-of-magnitude energy view
Precise numbers vary by implementation, distance, packaging, modulation, and whether the path is package-level, board-level, rack-level, or row-level. The important point is directional: short electrical links can be efficient at tiny distances, but the energy and thermal cost of longer, denser, higher-speed electrical movement rises quickly enough that optics becomes attractive not just for bandwidth, but for power-density relief.
Illustrative energy-per-bit intuition
| Path type | Typical context | Directionally plausible pJ/bit range | System implication |
|---|---|---|---|
| Very short electrical | Package / board-adjacent | Low single digits to low tens | Still attractive when reach is tiny and packaging can tolerate it |
| Longer electrical with heavy conditioning | Board-to-board / dense rack escape | Tens and rising | Retimers, equalization, and thermals dominate the story |
| Optical for longer reach | Rack-to-rack / scale-out / reconfigurable paths | Often more favorable at system level than equivalent long electrical | The win is not just speed; it is lower power-density pain at useful reach |
6. Board vs rack vs row regimes
Not all interconnect is the same problem. One reason discussions get muddy is that people collapse very different transport regimes into one word: network.
Board / package regime
- Main problem: electrical escape, signal integrity, local heat
- Main tools: short electrical, near-package optics, CPO, NPO
- Main question: how close can optics move toward the silicon?
Rack / row regime
- Main problem: cable density, reach, switch stages, topology oversubscription
- Main tools: pluggables, OCS, VCSEL arrays, rack-scale optics
- Main question: how much machine coherence can the fabric preserve?
7. Interconnect and the new memory wall
It would be a mistake to treat this essay as replacing the memory wall with a separate network story. The more interesting truth is that the two are converging.
In modern AI systems, memory hierarchy increasingly extends beyond local HBM. Once you start thinking in terms of pooled memory, remote memory, checkpoint tiers, model-state disaggregation, or distributed shared-memory semantics, the cost of movement across the fabric becomes part of the memory problem itself.
8. Where optical switching actually helps
Optical circuit switching is attractive because it changes the shape of the problem: instead of permanently paying for one static topology, the system can reconfigure fabric paths around communication phases.
Good fit for OCS
- Epoch-like communication phases
- Large synchronized tensor exchanges
- Workloads whose path demand can be forecast
Weak fit for OCS
- Highly random fine-grained traffic
- Flows whose duration is shorter than reconfiguration benefit
- Topologies where static overprovisioning is cheaper than control complexity
9. The scheduler must understand fabric
If the first post argued that the AI cluster operating system must understand light, this post explains why: because the network is becoming too expensive, thermally and electrically, to remain invisible to higher-level scheduling policy.
A serious future scheduler will not just place jobs on GPUs. It will reason about communication classes: which transfers are latency-critical, which ones are bursty but deferrable, which paths deserve optical reservation, and which flows can be degraded, delayed, compressed, or rerouted.
10. What winners will optimize for
- Power per useful transferred bit, not just theoretical link efficiency
- Topology-aware scheduling, where fabric constraints shape placement and execution
- Transport-class awareness, distinguishing collectives, checkpointing, remote memory, and control traffic
- Serviceable optical architectures, especially where external laser sources preserve operability while deeper optical integration attacks the power-density problem
- Holistic cluster economics, where the network is judged by delivered model throughput, not isolated component metrics
© 2026 Manish KL