A complete technical guide to SambaNova's heterogeneous inference blueprint, from the fundamentals of prefill vs decode, through the RDU dataflow architecture, to the full three-way split with Intel Xeon 6 that defines the agentic AI stack of 2026.
For years, the AI industry obsessed over training. More GPUs. Bigger clusters. Faster interconnects. The assumption was simple: the same parallel compute that trained a model would serve it too. That assumption is quietly breaking.
Today, inference, the act of running a trained model on real user requests, consumes the majority of production AI compute cycles. And it turns out inference has a fundamentally different computational character than training. Where training is a data processing problem, inference is a data flow problem: sequential, memory-hungry, and brutally sensitive to latency.
GPUs were engineered for maximum throughput on large, batched operations. But the decode phase of inference, where a model generates one token at a time, is nothing like that. It is sequential, it is memory-bandwidth-bound, and it is where GPU efficiency collapses. This is the crack in the foundation that SambaNova is building on.
Before understanding SambaNova's architecture, you need to internalize the two distinct phases of every LLM inference request. They are computationally almost nothing alike.
The entire input prompt is processed in parallel. The model reads all input tokens simultaneously, building a KV cache of attention states. This phase is compute-bound and naturally fits GPU-style matrix engines.
Tokens are generated one at a time. Each new token requires reading the KV cache and large weight slices from memory. This phase is memory-bandwidth-bound, sequential, and latency-sensitive.
This asymmetry is profound. During prefill, you can throw thousands of GPU CUDA cores at the problem simultaneously. During decode, you are waiting on memory reads, and adding more compute cores does almost nothing to speed it up. The two phases want entirely different hardware profiles.
A GPU running the decode phase is like a sports car in a traffic jam. The engine is powerful, but the bottleneck is the road: memory bandwidth, not arithmetic throughput. RDUs are purpose-built for exactly that road.
The key metrics here are TTFT (Time to First Token) and TPS (Tokens per Second). Optimizing only one is not enough for real workloads. A system with great TTFT but weak decode throughput will still feel slow once long reasoning chains begin.
SambaNova's core hardware insight is that a reconfigurable dataflow architecture fundamentally outperforms a GPU for the decode phase of inference. The SN40L and SN50 Reconfigurable Dataflow Units were built around this premise.
GPUs operate on a SIMD-style execution model and shine when the same operation is broadcast across large tensors. Decode is the opposite. Each step streams large weight slices and KV state, performs relatively modest compute, and then immediately stalls on the next memory movement. The compute-to-memory-access ratio collapses, so the expensive arithmetic fabric waits on data.
An RDU flips the architecture. Rather than a pool of generic cores repeatedly launching kernels, the compute fabric is reconfigured to match the model's dataflow graph at compile time. Data moves through fused operators with fewer redundant trips to external memory. SambaNova's compiler stack translates PyTorch or TensorFlow models into this fused execution graph.
The punchline is not that GPUs are bad. It is that decode exposes a very specific weakness in GPU-style execution, and SambaNova designed a different machine around that weakness.
SambaNova's bigger systems claim is that inference should not run on one hardware type at all. Instead, each phase should land on the device that actually matches its physics.
GPU for prefill, RDU for decode, x86 CPUs for action and orchestration. That is not vendor politics. It is workload decomposition.
Most data centers already have GPU racks. Those GPUs are excellent at prefill. Rather than replacing them entirely, this architecture lets organizations keep existing GPU investment where it performs well and deploy RDUs specifically where GPUs underperform: long-running, memory-bound decode.
Simple chatbot inference is one thing. Modern AI systems are increasingly agentic: they reason in loops, call external tools, write and execute code, query databases, and interact with messy enterprise software stacks. That introduces a third hardware requirement beyond prefill and decode: tool execution.
A coding agent can spend most of its wall-clock time waiting on token generation, not because the model is bad, but because decode throughput has not kept pace with longer reasoning chains and tool-rich workflows.
This is where Intel Xeon 6 enters the design. Those CPUs act both as host CPUs on SambaNova's own cards and as the action layer for everything that is not raw model inference: tool calls, API requests, compilation, database queries, and general orchestration.
| Operation | Hardware | Why it fits |
|---|---|---|
| Prompt processing / prefill | GPU racks | Compute-dense, parallel matrix operations |
| Token generation / decode | SambaNova RDUs | Memory-bandwidth optimized, fused dataflow execution |
| Tool calls and APIs | Intel Xeon 6 | Native compatibility with enterprise middleware and services |
| Code compilation | Intel Xeon 6 | CPU-native workflows and mature compiler ecosystem |
| Vector DB and system orchestration | Intel Xeon 6 | Good fit for general-purpose infrastructure work |
Enterprise AI cannot operate in a vacuum. It needs to read from databases, trigger webhooks, query search systems, and interact with decades-old operational software. The x86 layer is what makes agentic AI actually deployable rather than merely benchmarkable.
Hardware disaggregation only works if the software pieces can communicate cleanly across heterogeneous boundaries. SambaNova's strategy here is notable because it aligns with open-source interfaces rather than pretending the whole stack will remain proprietary forever.
Open-source inference frameworks that increasingly support heterogeneous and disaggregated execution patterns.
A standardized data-transfer layer for moving artifacts like KV cache between heterogeneous hardware tiers.
SambaNova's compiler and runtime layer, evolved from SambaFlow, for translating models into optimized RDU execution graphs.
The use of standardized transport and open inference stacks suggests something bigger than one vendor announcement: the industry increasingly understands that heterogeneous inference only becomes credible when the seams between hardware types are treated as product surfaces, not as proprietary traps.
SambaNova frames the problem around a practical trade-off space. The target is not maximum throughput in isolation or minimum latency in isolation. The target is strong decode speed and strong concurrent session count for real agentic workloads.
Whether every benchmark will agree on the exact shape of this zone is secondary. The systems argument is the important part: the frontier is moving toward mixed architectures because no single device profile cleanly wins every phase of inference anymore.
There is a dimension to this story that does not show up in isolated throughput charts: deployment realism. Extremely dense GPU racks increasingly drag facilities into liquid-cooling upgrades, power-delivery rewrites, and painful row-level retrofits.
SambaNova's air-cooled story matters because many enterprises, sovereign clouds, and public-sector environments cannot casually absorb the rack power densities of the newest GPU generations. Lower power and easier cooling are not aesthetic benefits. They are adoption multipliers.
The practical infrastructure constraint is often not whether a model can run somewhere, but whether the building can support the system that runs it.
| Approach | Representative | Strength | Limitation |
|---|---|---|---|
| GPU-only inference | NVIDIA H100/H200 | Massive ecosystem, excellent prefill | Decode underutilization, high power density |
| SRAM-first accelerator | Groq | Very low decode latency | Capacity constraints and tighter scaling envelopes |
| RDU dataflow | SambaNova SN50 | Decode optimization, air-cooled deployments, large-model support | Smaller ecosystem than GPUs |
| Heterogeneous disaggregation | SambaNova + Intel | Best hardware for each phase, leverages existing GPU investment | Higher system complexity |
SambaNova's disaggregated inference architecture represents a broader shift: the recognition that inference is not a monolithic workload and the era of one accelerator type doing everything gracefully is ending.
The prefill-decode split exposes a real hardware mismatch that the industry has often papered over by throwing more GPUs at the problem. As models get larger, reasoning chains get longer, and agentic workflows demand both concurrency and fast token emission, that workaround becomes harder to justify economically and physically.
The three-way disaggregation of GPUs, RDUs, and x86 CPUs is a clean systems answer to that reality. Whether SambaNova wins the category or not, the architectural direction it is naming feels much bigger than one company.
This essay covers the systems logic behind SambaNova's disaggregated inference strategy: why prefill and decode want different silicon, why agentic AI adds a third orchestration layer, and why open software seams matter as much as raw chip benchmarks.