1. The old scale-out mindset
Classical optics narratives were built around distance and throughput: longer reaches, higher speeds, cleaner transport between independent boxes. That model still matters. But AI clusters have changed the meaning of networking.
Once a training or inference system spans many accelerators that are expected to behave like one tightly coordinated machine, the network stops being a peripheral transport fabric and starts becoming part of the machine’s internal nervous system.
2. Why scale-up is taking over
Scale-up is taking over because AI workloads are intolerant of wasted movement inside tightly coupled domains. Gradient exchange, expert traffic, remote memory access, checkpoint coordination, and multi-chip execution all punish latency and fabric inefficiency much more harshly than older, looser distributed workloads did.
Scale-out priority
- More aggregate bandwidth across a broader fabric.
- Longer reach and large switching domains.
- Optics as high-capacity inter-box plumbing.
Scale-up priority
- Deterministic low-latency behavior inside a tightly coupled compute domain.
- Higher packaging and power-density sensitivity.
- Optics as an enabler of local machine coherence, not just remote transport.
That is exactly why current vendor demos are interesting. Lumentum’s OFC 2026 VCSEL demonstration is explicitly aimed at rack-level scale-up architectures using “slow and wide” protocols such as UCIe and PCIe, and Marvell plus Lumentum are showing optical circuit switching as a next-generation AI scale-up fabric tool. citeturn223394search3turn598745search6
3. What scale-up optics actually means
Scale-up optics does not simply mean “the same module, but closer.” It means optics being used to solve a different kind of problem.
Old framing
Move more bits between separate systems.
New framing
Preserve low-latency, high-bandwidth behavior inside one logical machine.
Design implication
Packaging, lane strategy, topology, and switch behavior matter more.
Operational result
Optics becomes part of system architecture, not just interconnect procurement.
That is why scale-up attracts technologies that might have looked niche in a purely distance-centric world: VCSEL arrays, OCS, near-package optics, co-packaged optics, and even optical paths that are explicitly designed around package and rack topology rather than metro-style reach.
4. The technologies that matter
What becomes more important in a scale-up world
| Technology | Why it matters for scale-up | What it changes |
|---|---|---|
| VCSEL arrays | Good fit for short-reach, slow-and-wide, rack-level links | Make dense local optical fabrics more practical |
| Optical circuit switching | Lets the fabric become reconfigurable and more topology-aware | Moves optics from static links toward dynamic machine composition |
| CPO / NPO | Reduce electrical escape burden and power density inside local domains | Push optics inward toward the heart of the machine |
| ELS architectures | Preserve serviceability while deeper optical integration increases | Balance efficiency with operability |
Coherent’s OFC 2026 portfolio and CPO announcements reinforce this point too: the company is not pitching one monolithic optical future, but multiple CPO approaches spanning silicon photonics, VCSEL, and InP-on-silicon, explicitly across both scale-out and scale-up scenarios. citeturn598745search0turn223394search7
5. Reliability tax inside the machine
There is a cost to moving optics inward: the blast radius of failure grows. In a loose scale-out network, a bad link is often an infrastructure event. In a tightly coupled 256-GPU or 512-GPU scale-up domain, a bad local optical component can become a machine-coherence event.
Old scale-out failure model
- A failed link is often absorbed by routing around it.
- The affected domain may be large in reach but loose in coupling.
- Performance loss matters, but coherence is less fragile.
Scale-up failure model
- A failed internal optical path can impair one tightly coupled machine, not just one route.
- Latency asymmetry and degraded bandwidth can break collective efficiency quickly.
- Redundancy, de-rating, and stronger error handling become architectural requirements.
This is why the scale-up future is not only about faster optics. It is also about making those optics software-visible, degradable, and survivable. If optics becomes part of the machine’s internal nervous system, then fault isolation, redundancy, and correction have to evolve with it.
6. Why software has to care
Scale-up optics is where your earlier “optics is an OS problem” thesis comes back with force. Inside a scale-up domain, software cannot afford to treat the fabric as a black box. The scheduler needs to understand which optical paths are local, which are degraded, which are power-expensive, and which topologies support the communication phase the workload is about to enter. This matters most for collective-heavy phases such as All-Reduce and All-to-All, where a few bad internal paths can create disproportionate tail-latency penalties across the whole machine.
This is also where optics starts to blur into memory semantics. In a 100kW-plus rack world, the winning local fabric is not merely the one with the lowest latency; it is the one with the best joules-per-bit at the topology points the workload hits hardest. Once scale-up fabrics help determine whether a cluster behaves like one large logical machine, the cost of optical movement becomes part of the cost of remote memory, collective execution, and even model partitioning.
7. 2027–2028 prediction: CPO first, OCS next
The likely winner in the 2027–2028 cycle is not “CPO or OCS” in the abstract. It is a layered scale-up stack where different technologies win at different distances and duties.
A practical prediction by role
| Layer | Likely near-term winner | Why |
|---|---|---|
| Innermost fixed, always-hot links near high-radix silicon | CPO / near-package optics | Best aligned with electrical escape relief, local power density, and fixed high-duty-cycle paths |
| Rack / pod-scale dynamic composition layer | OCS | Best aligned with topology reconfiguration, workload-phase adaptation, and flexible machine composition |
| Short-reach, slow-and-wide local fabrics | VCSEL-rich scale-up paths | Compelling where lane parallelism and dense local optical movement matter more than long reach |
So my forecast is simple: CPO and near-package optics likely win first where the problem is immediate and brutally physical — getting out of the package and across the board without drowning in electrical loss and heat. OCS becomes more important next where the problem is not local escape but dynamic rack or pod composition.
8. Where the next battle will be fought
The next battle will not only be over who ships the next transceiver generation. It will be over who defines the local optical fabric of the AI machine. And in practice that means solving not just for bandwidth and latency, but for joules per bit under real rack-scale power constraints.
- Who wins the rack-scale optical topology?
- Who controls the tradeoff between electrical proximity and optical flexibility?
- Who makes scale-up fabrics software-visible enough to become schedulable resources?
- Who solves the packaging, thermal, and serviceability problems well enough that local optics becomes operationally normal?
That is why “scale-out was yesterday” should not be read literally. Scale-out is still huge. The better point is that the most underappreciated optical opportunity has moved inward, toward scale-up, where AI infrastructure looks least like a conventional network and most like one giant machine trying to stay coherent under pressure.
Series note: after building the case for optics as an OS problem, a power-density problem, a failure-domain problem, a materials problem, and a market problem, this essay argues that the next real optical contest is moving toward scale-up fabrics inside the AI machine itself.
© 2026 Manish KL