AI Infrastructure / Photonics Series / Essay 3 — Deepened Edition

The Real AI Bottleneck Is Moving From Compute to Interconnect Power Density

For years the conversation was about FLOPS. Then it became about memory bandwidth. Now the harder wall is not raw transport speed but the full stack cost of moving bits: SerDes power, retimers, DSP burden, switch radix pressure, optical engines, cooling overhead, routing complexity, and the topology tax of making many accelerators behave like one machine.

The future of AI infrastructure will not be won by the chip with the biggest headline throughput alone. It will be won by the system that can move data across a cluster without turning interconnect into the dominant power, heat, packaging, and scheduling problem.

Theme: power per useful bit Focus: scale-up + scale-out fabrics Bridge: memory scheduling → optical scheduling
YesterdayMore accelerators, faster SerDes, bigger training runs.
TodayLink power, retimers, DSPs, thermals, cabling, and topology overhead.
TomorrowFabric becomes an allocatable system resource, not background plumbing.
Real testCan the machine move tensors cheaply enough to preserve effective compute?

1. Why FLOPS stopped being enough

The AI industry spent its first wave talking as if the main problem were arithmetic throughput. That made sense when the question was whether accelerators could do enough matrix math to keep frontier models training at all. But once clusters became large, multi-tiered, and communication-heavy, the bottleneck started moving outward.

The modern AI system is no longer one chip solving one problem. It is a choreography problem across accelerators, high-radix switches, retimers, optical modules, NICs, host memory, local storage, and remote racks. The system’s effective performance is increasingly controlled by how expensive it is to move activations, gradients, checkpoints, expert traffic, optimizer state, and inference context across the fabric.

A cluster does not get value from raw bandwidth alone. It gets value from delivered bandwidth that arrives at the right time, at tolerable energy cost, without collapsing thermals or operational simplicity.

2. What interconnect power density really means

“Interconnect power density” is not just watts per module. It is the compounded cost of pushing more and more communication through a fixed amount of electrical escape, board area, package margin, cooling headroom, and rack envelope.

Compute-centric thinking

  • Add more accelerators
  • Increase per-chip throughput
  • Assume the network will scale behind it

Fabric-centric reality

  • Every extra bit moved consumes power twice: once logically, once thermally
  • Signal conditioning, retimers, DSPs, and packaging all accumulate
  • Cooling, serviceability, and layout become first-order architectural constraints

3. A practical power-budget decomposition

The easiest way to make this concrete is to stop thinking of “network power” as one line item. In real AI systems, the transport path often includes several distinct budgets that each rise under scale pressure.

Interconnect cost is a stack, not a single number

Budget componentWhat it doesWhy it growsArchitectural consequence
SerDes / PHYDrives and receives high-speed electrical lanesHigher lane rates and denser escape stress equalization and signal integrityPackage edges and local board design become harder
Retimers / signal conditioningRecover margin on difficult electrical pathsLonger or noisier traces need more helpExtra power and extra heat for mere transport
DSP / optical module logicSupports optical signaling and recoveryLonger reach and more complex modulation raise burdenModule-level watts become a systems concern
Switch siliconMoves traffic through high-radix fabricsMore ports and more concurrent flows raise radix and scheduling pressureFabric design becomes topology-sensitive
Cooling overheadRemoves heat generated by all of the aboveHeat is concentrated in dense zones, not spread evenlyTransport watts cascade into cooling watts
This is why bits moved and useful compute delivered diverge. The machine pays several times for the same tensor movement.

4. The hidden topology tax of collectives

AI clusters do not move traffic uniformly. They produce synchronized bursts: all-reduce, all-gather, expert dispatch, parameter synchronization, checkpoint writes, and remote state fetches. Those patterns are topology-sensitive. A design that looks fine under averaged throughput can still perform badly if collective phases line up with the wrong fabric shape.

Collective phase

Many devices communicate in structured bursts, not random flows.

Topology stress

Hot links, oversubscribed stages, and switch contention appear unevenly.

Power and heat spike

Transport components light up precisely when the workload is most synchronized.

Effective compute loss

Accelerators stall waiting on movement rather than arithmetic.

5. Order-of-magnitude energy view

Precise numbers vary by implementation, distance, packaging, modulation, and whether the path is package-level, board-level, rack-level, or row-level. The important point is directional: short electrical links can be efficient at tiny distances, but the energy and thermal cost of longer, denser, higher-speed electrical movement rises quickly enough that optics becomes attractive not just for bandwidth, but for power-density relief.

Illustrative energy-per-bit intuition

Path typeTypical contextDirectionally plausible pJ/bit rangeSystem implication
Very short electricalPackage / board-adjacentLow single digits to low tensStill attractive when reach is tiny and packaging can tolerate it
Longer electrical with heavy conditioningBoard-to-board / dense rack escapeTens and risingRetimers, equalization, and thermals dominate the story
Optical for longer reachRack-to-rack / scale-out / reconfigurable pathsOften more favorable at system level than equivalent long electricalThe win is not just speed; it is lower power-density pain at useful reach
This is a systems chart, not a vendor claim. The point is architectural intuition, not fake precision.

6. Board vs rack vs row regimes

Not all interconnect is the same problem. One reason discussions get muddy is that people collapse very different transport regimes into one word: network.

Board / package regime

  • Main problem: electrical escape, signal integrity, local heat
  • Main tools: short electrical, near-package optics, CPO, NPO
  • Main question: how close can optics move toward the silicon?

Rack / row regime

  • Main problem: cable density, reach, switch stages, topology oversubscription
  • Main tools: pluggables, OCS, VCSEL arrays, rack-scale optics
  • Main question: how much machine coherence can the fabric preserve?

7. Interconnect and the new memory wall

It would be a mistake to treat this essay as replacing the memory wall with a separate network story. The more interesting truth is that the two are converging.

In modern AI systems, memory hierarchy increasingly extends beyond local HBM. Once you start thinking in terms of pooled memory, remote memory, checkpoint tiers, model-state disaggregation, or distributed shared-memory semantics, the cost of movement across the fabric becomes part of the memory problem itself.

If moving data to another rack is too hot, too power-hungry, or too topology-sensitive, then the cluster cannot honestly treat that remote state as cheap logical memory. Interconnect power density becomes a limit on how large a machine can behave like one coherent memory entity.

8. Where optical switching actually helps

Optical circuit switching is attractive because it changes the shape of the problem: instead of permanently paying for one static topology, the system can reconfigure fabric paths around communication phases.

Good fit for OCS

  • Epoch-like communication phases
  • Large synchronized tensor exchanges
  • Workloads whose path demand can be forecast

Weak fit for OCS

  • Highly random fine-grained traffic
  • Flows whose duration is shorter than reconfiguration benefit
  • Topologies where static overprovisioning is cheaper than control complexity

9. The scheduler must understand fabric

If the first post argued that the AI cluster operating system must understand light, this post explains why: because the network is becoming too expensive, thermally and electrically, to remain invisible to higher-level scheduling policy.

A serious future scheduler will not just place jobs on GPUs. It will reason about communication classes: which transfers are latency-critical, which ones are bursty but deferrable, which paths deserve optical reservation, and which flows can be degraded, delayed, compressed, or rerouted.

10. What winners will optimize for

  • Power per useful transferred bit, not just theoretical link efficiency
  • Topology-aware scheduling, where fabric constraints shape placement and execution
  • Transport-class awareness, distinguishing collectives, checkpointing, remote memory, and control traffic
  • Serviceable optical architectures, especially where external laser sources preserve operability while deeper optical integration attacks the power-density problem
  • Holistic cluster economics, where the network is judged by delivered model throughput, not isolated component metrics
Compute still matters. But in modern AI systems, the decisive question is increasingly whether the cluster can afford to move the data that compute generates.