MAN\SH AI
Architecture Deep Dive AI Infrastructure Agentic AI

SambaNova Disaggregated Inference: The Architecture Reinventing How AI Thinks at Scale

A complete technical guide to SambaNova's heterogeneous inference blueprint, from the fundamentals of prefill vs decode, through the RDU dataflow architecture, to the full three-way split with Intel Xeon 6 that defines the agentic AI stack of 2026.

The Inference Crisis Nobody Saw Coming

For years, the AI industry obsessed over training. More GPUs. Bigger clusters. Faster interconnects. The assumption was simple: the same parallel compute that trained a model would serve it too. That assumption is quietly breaking.

Today, inference, the act of running a trained model on real user requests, consumes the majority of production AI compute cycles. And it turns out inference has a fundamentally different computational character than training. Where training is a data processing problem, inference is a data flow problem: sequential, memory-hungry, and brutally sensitive to latency.

31×
projected growth in AI data-center power demand from 2024 to 2035
120kW
per rack for top-end GPU systems like Blackwell NVL72
~10×
lower power draw claimed for SambaNova RDU racks versus comparable GPU systems

GPUs were engineered for maximum throughput on large, batched operations. But the decode phase of inference, where a model generates one token at a time, is nothing like that. It is sequential, it is memory-bandwidth-bound, and it is where GPU efficiency collapses. This is the crack in the foundation that SambaNova is building on.

LLM Inference 101: Prefill vs Decode

Before understanding SambaNova's architecture, you need to internalize the two distinct phases of every LLM inference request. They are computationally almost nothing alike.

Prefill Phase

The entire input prompt is processed in parallel. The model reads all input tokens simultaneously, building a KV cache of attention states. This phase is compute-bound and naturally fits GPU-style matrix engines.

Decode Phase

Tokens are generated one at a time. Each new token requires reading the KV cache and large weight slices from memory. This phase is memory-bandwidth-bound, sequential, and latency-sensitive.

This asymmetry is profound. During prefill, you can throw thousands of GPU CUDA cores at the problem simultaneously. During decode, you are waiting on memory reads, and adding more compute cores does almost nothing to speed it up. The two phases want entirely different hardware profiles.

Key insight

A GPU running the decode phase is like a sports car in a traffic jam. The engine is powerful, but the bottleneck is the road: memory bandwidth, not arithmetic throughput. RDUs are purpose-built for exactly that road.

LLM inference pipeline Input prompt Tell me about AI 3 tokens PREFILL Parallel processing Compute-bound KV Cache Attention states stored DECODE Sequential generation Memory-bandwidth-bound Output tokens one at a time tok1 tok2 tok3 ... TTFT window
Fig 1. Prefill processes the prompt in parallel and builds the KV cache. Decode then generates output one token at a time using that cache.

The key metrics here are TTFT (Time to First Token) and TPS (Tokens per Second). Optimizing only one is not enough for real workloads. A system with great TTFT but weak decode throughput will still feel slow once long reasoning chains begin.

The RDU: A Chip Born for Dataflow

SambaNova's core hardware insight is that a reconfigurable dataflow architecture fundamentally outperforms a GPU for the decode phase of inference. The SN40L and SN50 Reconfigurable Dataflow Units were built around this premise.

What Makes a GPU Inefficient for Decode

GPUs operate on a SIMD-style execution model and shine when the same operation is broadcast across large tensors. Decode is the opposite. Each step streams large weight slices and KV state, performs relatively modest compute, and then immediately stalls on the next memory movement. The compute-to-memory-access ratio collapses, so the expensive arithmetic fabric waits on data.

The Dataflow Difference

An RDU flips the architecture. Rather than a pool of generic cores repeatedly launching kernels, the compute fabric is reconfigured to match the model's dataflow graph at compile time. Data moves through fused operators with fewer redundant trips to external memory. SambaNova's compiler stack translates PyTorch or TensorFlow models into this fused execution graph.

GPU versus RDU memory pattern GPU approach Repeated DRAM trips per kernel DRAM Op A Op B Op C Memory dominates; kernels bounce in and out RDU approach Fused dataflow graph Weights / KV Fuse A Fuse B Fuse C Data stays in a tighter graph; fewer wasted trips
Fig 2. GPUs repeatedly shuttle data between memory and kernels. RDUs try to keep the execution graph fused so decode spends less time paying that tax.

The punchline is not that GPUs are bad. It is that decode exposes a very specific weakness in GPU-style execution, and SambaNova designed a different machine around that weakness.

Disaggregated Inference: The Three-Way Split

SambaNova's bigger systems claim is that inference should not run on one hardware type at all. Instead, each phase should land on the device that actually matches its physics.

Architecture thesis

GPU for prefill, RDU for decode, x86 CPUs for action and orchestration. That is not vendor politics. It is workload decomposition.

Three-way disaggregated inference architecture GPU racks prefill layer PREFILL Compute-bound Parallel matrix ops KV cache SambaRacks SN40 / SN50 RDUs DECODE Memory-bandwidth Token-by-token actions Intel Xeon 6 action CPUs ORCHESTRATION Tool calls and APIs Compilation and systems Existing GPU investment Decode-optimized silicon Enterprise x86 compatibility vLLM · SGLang · NIXL across the tiers
Fig 3. SambaNova's heterogeneous inference stack: GPUs handle prefill, RDUs handle decode, and Intel Xeon 6 handles the action layer around the model.

Why This Makes Sense Economically

Most data centers already have GPU racks. Those GPUs are excellent at prefill. Rather than replacing them entirely, this architecture lets organizations keep existing GPU investment where it performs well and deploy RDUs specifically where GPUs underperform: long-running, memory-bound decode.

The Agentic AI Problem: Why Disaggregation Goes Deeper

Simple chatbot inference is one thing. Modern AI systems are increasingly agentic: they reason in loops, call external tools, write and execute code, query databases, and interact with messy enterprise software stacks. That introduces a third hardware requirement beyond prefill and decode: tool execution.

The agentic bottleneck

A coding agent can spend most of its wall-clock time waiting on token generation, not because the model is bad, but because decode throughput has not kept pace with longer reasoning chains and tool-rich workflows.

This is where Intel Xeon 6 enters the design. Those CPUs act both as host CPUs on SambaNova's own cards and as the action layer for everything that is not raw model inference: tool calls, API requests, compilation, database queries, and general orchestration.

OperationHardwareWhy it fits
Prompt processing / prefillGPU racksCompute-dense, parallel matrix operations
Token generation / decodeSambaNova RDUsMemory-bandwidth optimized, fused dataflow execution
Tool calls and APIsIntel Xeon 6Native compatibility with enterprise middleware and services
Code compilationIntel Xeon 6CPU-native workflows and mature compiler ecosystem
Vector DB and system orchestrationIntel Xeon 6Good fit for general-purpose infrastructure work

Enterprise AI cannot operate in a vacuum. It needs to read from databases, trigger webhooks, query search systems, and interact with decades-old operational software. The x86 layer is what makes agentic AI actually deployable rather than merely benchmarkable.

The Software Stack: Standards Over Silos

Hardware disaggregation only works if the software pieces can communicate cleanly across heterogeneous boundaries. SambaNova's strategy here is notable because it aligns with open-source interfaces rather than pretending the whole stack will remain proprietary forever.

Inference engine

vLLM and SGLang

Open-source inference frameworks that increasingly support heterogeneous and disaggregated execution patterns.

Transport

NIXL

A standardized data-transfer layer for moving artifacts like KV cache between heterogeneous hardware tiers.

Compiler

SambaStack

SambaNova's compiler and runtime layer, evolved from SambaFlow, for translating models into optimized RDU execution graphs.

The use of standardized transport and open inference stacks suggests something bigger than one vendor announcement: the industry increasingly understands that heterogeneous inference only becomes credible when the seams between hardware types are treated as product surfaces, not as proprietary traps.

The Goldilocks Zone: Balancing Throughput and Latency

SambaNova frames the problem around a practical trade-off space. The target is not maximum throughput in isolation or minimum latency in isolation. The target is strong decode speed and strong concurrent session count for real agentic workloads.

Inference Goldilocks zone Token speed Throughput SRAM fast, rigid GPU cluster throughput-heavy Goldilocks zone SambaNova RDUs Fast decode + concurrency Target: agentic inference sweet spot
Fig 4. The trade-off space as framed by heterogeneous inference advocates: GPUs are strong at throughput-heavy prefill, SRAM systems at raw speed, and RDUs aim for the middle ground that suits agentic decode.

Whether every benchmark will agree on the exact shape of this zone is secondary. The systems argument is the important part: the frontier is moving toward mixed architectures because no single device profile cleanly wins every phase of inference anymore.

The Power Angle: Why Air Cooling Matters

There is a dimension to this story that does not show up in isolated throughput charts: deployment realism. Extremely dense GPU racks increasingly drag facilities into liquid-cooling upgrades, power-delivery rewrites, and painful row-level retrofits.

SambaNova's air-cooled story matters because many enterprises, sovereign clouds, and public-sector environments cannot casually absorb the rack power densities of the newest GPU generations. Lower power and easier cooling are not aesthetic benefits. They are adoption multipliers.

The practical infrastructure constraint is often not whether a model can run somewhere, but whether the building can support the system that runs it.
A useful way to read the SambaNova value proposition

How We Got Here: A Brief Timeline

2017
SambaNova Systems is founded around reconfigurable dataflow hardware for machine learning acceleration.
2023
SN40L arrives and the company increasingly pivots toward inference rather than training as the primary commercial wedge.
2024
SambaNova Cloud appears, turning the hardware thesis into an externally consumable inference service.
2025
Managed deployments and broader cloud-service positioning sharpen the company around production inference use cases.
2026
SN50, Intel Xeon 6 collaboration, and the public framing of a three-way heterogeneous inference architecture bring the disaggregated story into sharper focus.

Where This Fits in the Competitive Landscape

ApproachRepresentativeStrengthLimitation
GPU-only inferenceNVIDIA H100/H200Massive ecosystem, excellent prefillDecode underutilization, high power density
SRAM-first acceleratorGroqVery low decode latencyCapacity constraints and tighter scaling envelopes
RDU dataflowSambaNova SN50Decode optimization, air-cooled deployments, large-model supportSmaller ecosystem than GPUs
Heterogeneous disaggregationSambaNova + IntelBest hardware for each phase, leverages existing GPU investmentHigher system complexity

The Bigger Picture

SambaNova's disaggregated inference architecture represents a broader shift: the recognition that inference is not a monolithic workload and the era of one accelerator type doing everything gracefully is ending.

The prefill-decode split exposes a real hardware mismatch that the industry has often papered over by throwing more GPUs at the problem. As models get larger, reasoning chains get longer, and agentic workflows demand both concurrency and fast token emission, that workaround becomes harder to justify economically and physically.

The three-way disaggregation of GPUs, RDUs, and x86 CPUs is a clean systems answer to that reality. Whether SambaNova wins the category or not, the architectural direction it is naming feels much bigger than one company.

This essay covers the systems logic behind SambaNova's disaggregated inference strategy: why prefill and decode want different silicon, why agentic AI adds a third orchestration layer, and why open software seams matter as much as raw chip benchmarks.