← All writings
AI Networking Topology Fat-Tree Dragonfly MoE Routing Inference Architecture

Fat-Tree Is Lying to You: Network Topology as a First-Class Inference Constraint

Published · 22 min read · By MANISH AI
Abstract. Fat-tree is the default network topology for AI clusters. It was designed for uniform all-to-all traffic — collective communications in distributed training where every node talks to every other node with roughly equal probability. Modern AI inference workloads are not uniform. MoE expert routing creates skewed all-to-few traffic. Prefill-decode disaggregation creates asymmetric producer-consumer pairs. KV cache transfer creates directed high-bandwidth flows between fixed endpoint pairs. Fat-tree is a bad match for all three. This essay argues that network topology is not a data-center infrastructure decision made once at cluster build time — it is a first-class inference constraint that should influence expert co-location decisions, disaggregation placement, and KV routing policy. We derive the traffic patterns of modern inference workloads from first principles, map them onto the bandwidth and latency profiles of Fat-tree, Dragonfly, and Rail-Optimized topologies, and show that topology-unaware scheduling wastes 30-50% of available network capacity on typical production inference loads. The right answer is not a new topology. It is making topology a visible input to the scheduler.
30–50% network capacity wasted by topology-unaware MoE scheduling
bisection bandwidth reduction in 3-tier Fat-tree under adversarial traffic
2–4 μs extra latency per KV transfer hop crossing a spine switch vs. ToR
1 traffic model assumption underlying Fat-tree: uniform all-to-all. AI inference violates it.
Contents
  1. The fat-tree assumption and where it breaks
  2. The three traffic patterns of modern inference
  3. Fat-tree under non-uniform load: the bisection bandwidth trap
  4. Dragonfly: global routing and the locality problem
  5. Rail-optimized: the right topology for the wrong reason
  6. MoE expert routing as a topology problem
  7. Prefill-decode disaggregation and asymmetric flow
  8. KV transfer topology: the missing scheduling input
  9. Making topology a scheduler input
  10. The topology-aware cluster is not a hardware redesign

1. The fat-tree assumption and where it breaks

Fat-tree became the dominant AI cluster topology for a specific reason: it was designed to handle all-to-all collective communications in distributed training with full bisection bandwidth. In an ideal Fat-tree with enough oversubscription ratio, any node can communicate with any other node at line rate simultaneously. The topology is symmetric, non-blocking (in theory), and administratively simple — every port has the same bandwidth, routing is ECMP, and the cluster looks like a flat non-hierarchical network to the application.

The traffic model that justifies Fat-tree is: every node talks to every other node with equal probability, and the aggregate traffic matrix is close to uniform. This is a reasonable model for AllReduce in data-parallel training, where gradient updates from every GPU need to be aggregated and broadcast to every other GPU with roughly equal bandwidth in all directions.

It is not a reasonable model for inference.

Modern inference workloads have structured, non-uniform traffic matrices. The traffic is not uniformly distributed — it concentrates in specific patterns determined by the model architecture, the serving topology, and the request distribution. A Fat-tree network does not distinguish between traffic that crosses three switch hops and traffic that crosses zero — it routes all traffic the same way, burning bandwidth and adding latency on flows that could have stayed local, while creating congestion on uplinks that carry disproportionate traffic.

The thesis: The network topology you choose determines which traffic patterns are expensive and which are cheap. Modern inference has a specific, analyzable traffic matrix. Deploying it on a topology designed for a different traffic matrix — uniform all-to-all — is not infrastructure agnosticism. It is leaving performance on the table in ways that are measurable and avoidable.

2. The three traffic patterns of modern inference

To understand why topology matters, we need to derive the traffic matrix of modern inference workloads from first principles. There are three distinct patterns, each with different bandwidth, latency, and locality requirements.

2.1 MoE expert-to-expert traffic: all-to-few, skewed

In a Mixture-of-Experts model, each token is routed by a learned router to a small subset of expert FFN blocks — typically 2 out of N experts, where N is 8, 64, or larger in frontier models. The router makes this decision independently for each token. In a distributed serving scenario where experts are sharded across GPUs, a token arriving at GPU 0 may be routed to experts on GPU 7 and GPU 23 — requiring two network transfers to process that token.

MoE traffic matrix structure
For a 64-expert MoE with top-2 routing, 32 GPUs, 2 experts/GPU:

Per-token routing:
  token → router → [expert_i, expert_j], i ≠ j with high probability

Traffic matrix entry T[src][dst] = tokens routed from src GPU to dst GPU

If routing were uniform:  T[src][dst] = 1/(N-1) for all src ≠ dst
                           → uniform all-to-all matrix

Observed in practice:     router learns expert specialization
                           → hot experts receive 3-5× average load
                           → cold experts receive 0.2-0.4× average load
                           → traffic matrix is sparse and skewed

Consequence: 3-5 "popular" expert GPUs receive 3-5× expected traffic
             Those GPUs' uplinks become bottlenecks under Fat-tree routing

The skewness is not random — it is learned and stable. Popular experts are popular because they specialize in common linguistic patterns. This means the traffic matrix of a deployed MoE model is predictable and persistent, making it possible to optimize placement for it. The question is whether the infrastructure layer is designed to exploit that predictability.

2.2 Prefill-decode disaggregation: asymmetric producer-consumer

Prefill-decode disaggregation (covered extensively in prior essays) routes prefill computation to high-compute GPUs and decode computation to high-bandwidth GPUs. The traffic pattern this creates is asymmetric: prefill GPUs produce KV caches and transfer them to decode GPUs. Every request generates one large KV transfer from a prefill node to a decode node.

For a 128K-context request on a 70B model, that transfer is approximately 139 GB. At typical inference throughput, a single prefill GPU might produce 2-4 such requests per minute, generating sustained 4-8 GB/s of directed traffic to a specific decode GPU pool. This is not all-to-all traffic. It is a structured producer-consumer flow between two dedicated GPU pools.

2.3 KV cache mobility: rack-local preference

For KV cache disaggregation and hierarchical KV storage (HBM → CXL DRAM → NVMe), the ideal traffic pattern is rack-local or pod-local. A decode GPU fetching KV pages from a CXL-attached memory device in the same rack generates no network traffic. The same fetch from a KV store in a different rack generates a network transfer with latency determined by the number of switch hops between them.

Under Fat-tree routing, a rack-to-rack transfer may traverse ToR → aggregation → spine → aggregation → ToR — four hops adding 8-16 μs of switching latency on top of the base propagation delay. For latency-sensitive decode steps where each step adds to TTFT (Time to First Token), this is not negligible.

3. Fat-tree under non-uniform load: the bisection bandwidth trap

Fat-tree achieves full bisection bandwidth under uniform traffic. Under non-uniform traffic, its performance degrades in a way that is analytically predictable but operationally surprising to teams that have not studied it.

Consider a 3-tier Fat-tree with k-port switches (k=64 is common for modern 400G switches). The topology has k²/4 = 1,024 servers, k/2 = 32 top-of-rack switches, k/2 = 32 aggregation switches, and (k/2)² / k = 16 core/spine switches. The bisection bandwidth between any two halves of the cluster is k²/4 × link_bandwidth.

Fat-tree bisection bandwidth under skewed traffic
Uniform traffic assumption:
  Each server sends traffic to all other 1,023 servers equally
  → Each uplink carries 1/32 of server traffic
  → Spine switches see balanced load across all ports
  → Full bisection bandwidth maintained

MoE-skewed traffic (hot experts on 3 GPUs):
  Hot GPUs receive 4× average inbound traffic
  → Their ToR uplinks saturate at 4× average rate
  → Aggregation switches serving those ToRs saturate
  → ECMP begins hashing flows onto congested uplinks
  → Effective bisection bandwidth to hot GPUs: ~1/4 of nominal

Practical consequence:
  Cluster with 400 Gbps uplinks and "full bisection" Fat-tree
  delivers ~100 Gbps effective bandwidth to hot expert GPUs
  during peak MoE routing load — a 4× degradation from spec.

The problem is that ECMP (Equal-Cost Multi-Path routing) — the standard routing protocol used in Fat-tree — is flow-based. It assigns a flow to a path based on a hash of the flow's 5-tuple and keeps that flow on the same path for its duration. When multiple high-bandwidth flows hash onto the same uplinks, those uplinks become congested while other uplinks remain idle. The topology has the capacity, but the routing algorithm cannot redistribute it to where the load is.

The ECMP Failure Mode

ECMP is designed to balance load across paths for uniform traffic. For structured traffic with persistent hot destinations (MoE expert GPUs, prefill nodes, KV servers), ECMP will consistently hash disproportionate traffic onto a subset of paths. This is not a configuration error. It is an architectural mismatch between a routing algorithm designed for uniform traffic and workloads with non-uniform traffic matrices.

4. Dragonfly: global routing and the locality problem

Dragonfly was designed to reduce the diameter of large-scale networks by adding global links between groups. In a k-level Dragonfly, each group of routers connects to every other group through a single global link. This dramatically reduces the worst-case hop count — from O(log N) in Fat-tree to O(1) in an ideal Dragonfly — at the cost of more complex routing and adaptive load balancing requirements.

For uniform all-to-all traffic, Dragonfly outperforms Fat-tree at scale because it achieves lower latency with fewer hops. For AI inference, Dragonfly introduces a different problem: it treats all inter-group traffic as equally costly, regardless of whether the source and destination are geographically adjacent or distant.

In practice, AI inference benefits strongly from locality. MoE experts that co-occur frequently should be co-located on the same ToR or same group to avoid global links. Prefill-decode pairs with high KV transfer volume should be in adjacent groups. KV cache stores should be rack-local to the decode GPUs that consume them most frequently. Dragonfly's global routing — which deliberately spreads traffic across all global links to avoid hotspots — works against locality optimization.

TopologyBisection BW (uniform)Latency (worst case)Locality supportMoE hot-expert handling
3-tier Fat-treeFull (1:1)4–6 hopsRack-levelPoor — ECMP congestion
DragonflyFull at scale2–3 hopsNone — global routingModerate — adaptive routing helps, locality hurt
Rail-OptimizedWithin-rail: full; cross-rail: limited1–2 hops intra-railRail = locality groupGood if hot experts on same rail
Torus (TPU Pod)Topology-dependentO(√N) hopsStrong — distance-awareExcellent if workload mapped to geometry

5. Rail-optimized: the right topology for the right reason

Rail-optimized topology — used by Meta for large-scale FSDP training and increasingly adopted for inference clusters — partitions GPUs into groups called rails, where each rail is a fully connected subset of GPUs with high intra-rail bandwidth. Cross-rail bandwidth is lower. The design assumption is that most communication stays within a rail, with cross-rail traffic being the exception.

For distributed training with FSDP, the rail is sized to hold one full model replica. AllGather and ReduceScatter operations for parameter synchronization stay within the rail. Only gradient synchronization crosses rail boundaries. This makes the traffic pattern rail-local with predictable cross-rail overhead.

For inference, the rail abstraction maps naturally onto a different partition: the rail becomes a serving domain. MoE experts for a single model are co-located within a rail. Prefill and decode GPUs for a given request pair are within the same rail. KV cache stores are rail-local to their decode consumers. The rail boundary becomes a natural boundary for the inference topology, and cross-rail traffic is the exception that occurs only for model parallelism across very large models that don't fit within a single rail.

5.1 Why rail-optimized is not universally superior

Rail-optimized topology trades flexibility for locality. When the workload fits the rail partition, performance is excellent. When it does not — heterogeneous models requiring cross-rail parallelism, bursty request patterns that violate static partition assumptions, or multi-model serving where requests have different locality requirements — the restricted cross-rail bandwidth becomes a bottleneck rather than a feature.

The deeper issue is that rail-optimized topology is a static design choice made at cluster build time. The serving workload changes dynamically: model mix, request rate, context length distribution. A static topology cannot adapt to dynamic workload requirements.

6. MoE expert routing as a topology problem

The standard treatment of MoE routing focuses on the load balancing problem: how to ensure experts receive roughly equal traffic so none becomes a throughput bottleneck. The standard solution is auxiliary loss terms that penalize unbalanced routing during training. The topology dimension is almost entirely absent from the MoE literature.

This is a mistake. The MoE routing problem and the topology placement problem are coupled. Consider:

MoE placement optimization (network-aware)
Naive placement:
  Experts 0..7  → GPUs 0..7  (rack 0)
  Experts 8..15 → GPUs 8..15 (rack 1)
  Experts 16..23 → GPUs 16..23 (rack 2)
  ...
  Expert co-occurrence ignored. Random token routing crosses rack boundaries uniformly.

Network-aware placement:
  Step 1: Collect expert co-occurrence statistics over production traffic
          co_occur[i][j] = fraction of tokens routed to both expert i and expert j

  Step 2: Build expert affinity graph where edge weight = co_occur[i][j]

  Step 3: Graph partition to maximize intra-partition affinity under rack capacity
          → Frequently co-occurring experts placed in same rack
          → Most token dispatches stay intra-rack (zero spine switches crossed)

  Expected result:
  For top-2 routing, if 60% of token pairs co-occur within a rack:
  → 60% of dispatch traffic is intra-rack (0-1 hops)
  → 40% of dispatch traffic crosses rack (2-4 hops)
  vs. naive: 100% of cross-GPU traffic potentially crosses rack boundary

The graph partition step is computationally tractable because expert co-occurrence distributions are typically long-tailed — a small number of expert pairs account for a large fraction of co-occurrences. Placing the high-frequency pairs on the same rack produces most of the locality benefit.

More importantly: this optimization requires the serving system to be aware of the network topology. The scheduler needs to know which GPUs are in the same rack and what the bandwidth cost of different placements would be. This information is available — it is in the cluster's network configuration — but it is not today exposed to the inference scheduler in any major serving framework.

7. Prefill-decode disaggregation and asymmetric flow

Prefill-decode disaggregation generates a directed, asymmetric, high-bandwidth flow from prefill nodes to decode nodes. The network implication depends entirely on where prefill and decode nodes are placed relative to each other in the topology.

Best case: prefill and decode nodes are in the same rack, connected through the same ToR switch. KV transfer is intra-rack, consuming zero spine bandwidth, with latency measured in microseconds.

Typical case under random placement: prefill and decode nodes are distributed across the cluster. KV transfer crosses aggregation and spine switches. Spine bandwidth is consumed. Latency is 4-8× higher than intra-rack.

Worst case: prefill nodes and decode nodes are in different pods or different fabrics. Every KV transfer crosses maximum hops, consuming maximum bandwidth at every switch tier.

KV transfer cost vs. topology distance
KV transfer size: 70B FP8, 32K context ≈ 35 GB

Intra-rack transfer (ToR only):
  Available BW: 400 Gbps (50 GB/s) per link
  Transfer time: 35 GB / 50 GB/s ≈ 0.7 seconds
  Switch hops: 1 (ToR)
  Spine BW consumed: 0

Cross-rack, same pod (ToR → Agg → ToR):
  Available BW: 100-200 Gbps (oversubscribed uplinks)
  Transfer time: 35 GB / 12.5-25 GB/s ≈ 1.4 – 2.8 seconds
  Switch hops: 3
  Spine BW consumed: 0, Agg BW consumed: yes

Cross-pod (ToR → Agg → Spine → Agg → ToR):
  Available BW: 50-100 Gbps (heavily oversubscribed)
  Transfer time: 35 GB / 6-12.5 GB/s ≈ 2.8 – 5.8 seconds
  Switch hops: 5
  Spine BW consumed: yes

A 4-8× increase in KV transfer time directly increases TTFT for disaggregated serving. For latency-sensitive applications, this is not acceptable. The only solution is topology-aware placement of prefill-decode pairs — ensuring that for any given model, the prefill and decode pools are topologically adjacent.

8. KV transfer topology: the missing scheduling input

The essays in this series on KV fabrics, disaggregated inference, and RDMA have established that KV cache transfer is a first-class infrastructure concern. What has not been addressed is the topology layer: the scheduler that decides which KV store serves which decode request needs to know the network distance between every KV store and every decode GPU.

This is the missing input. Current KV routing systems (including the Seam Orchestrator essay's design) treat the network as a flat fabric with uniform cost. The actual cost of a KV fetch depends on the topology distance between the KV store and the requesting GPU. A KV page on a CXL-attached memory device in the same rack costs fundamentally less to fetch than the same page on a remote memory server two spine hops away.

The Missing Scheduler Input

Topology distance between KV store and decode GPU should be a first-class input to the KV admission and routing decision. A KV page that is "cold" by recency but physically close to the requesting GPU may be cheaper to serve than a "warm" page that requires spine-crossing network transfers. Topology-unaware eviction policies optimize for the wrong cost function.

9. Making topology a scheduler input

The practical question is: what does it mean to make topology a scheduler input? It does not require redesigning the network. It requires four things:

First: A topology distance matrix. For every pair of GPUs in the cluster, compute the hop count and available bandwidth between them. This is derivable from the switch configuration and does not change at serving time. Store it as a lookup table accessible to the serving scheduler.

Second: Topology-aware placement for MoE experts. When a new model is deployed, run the co-occurrence-based graph partition algorithm to assign experts to GPUs that minimize expected dispatch traffic. Update the placement when co-occurrence statistics shift (e.g., after a new model version).

Third: Topology-aware prefill-decode pairing. When a request arrives that will use disaggregated serving, assign it to a prefill-decode pair that is topologically adjacent. This requires the serving scheduler to have visibility into which prefill pools are topologically near which decode pools.

Fourth: Topology distance in KV routing. When deciding which KV store to serve a fetch from, include topology distance as a cost term alongside recency and size. A KV page two hops away costs more to fetch than a page with slightly lower hit probability in the same rack.

OptimizationRequired Topology InformationScheduling Decision ModifiedExpected Gain
MoE expert placementRack assignment map, intra/inter-rack BWExpert-to-GPU assignment at deployment30-50% reduction in cross-rack MoE dispatch traffic
P-D pair placementToR-level proximity, uplink BWRequest assignment to prefill pool2-4× reduction in KV transfer latency for disaggregated serving
KV store routingHop count + BW to each KV storeKV fetch source selection15-25% reduction in average KV fetch latency
Tensor parallel placementNVLink vs. InfiniBand cost modelTP group assignment to GPU setNear-zero cross-rack TP traffic for properly placed groups

10. The topology-aware cluster is not a hardware redesign

The argument of this essay is not that AI clusters need to adopt a new topology. Fat-tree is entrenched for good reasons: it is administratively simple, it scales, it is understood. Rail-optimized has merit for specific workloads. The point is not which topology is best.

The point is that topology should not be invisible to the serving software. Every inference scheduling decision that involves inter-GPU communication has an implicit network cost that varies with topology distance. Making that cost explicit — by exposing a topology distance matrix to the scheduler and using it in placement, routing, and eviction decisions — does not require any hardware change. It requires that the serving layer stop treating the network as a flat uniform fabric.

The network is not flat. It is a hierarchy with well-defined cost gradients. Intra-rack is cheap. Cross-rack is moderate. Cross-pod is expensive. The traffic patterns of modern inference — MoE dispatch, KV transfer, prefill-decode communication — vary in cost by 4-8× depending on where in the hierarchy they land. Topology-unaware scheduling wastes that 4-8× spread.

The scheduler is the right place to fix it. And the fix starts with treating topology not as a physical plant problem, but as a scheduling input.