There is a natural next step once you start thinking about SmartNICs, GPUDirect, and zero-copy system design: if storage can reach the accelerator without all the old CPU and host-memory overhead, can the rest of the path become just as clean?
The answer is subtle. At the system level, yes — a great deal of waste can be removed. GPUDirect Storage provides a direct DMA path between storage and GPU memory while avoiding a CPU bounce buffer, and GPUDirect RDMA enables direct data exchange between GPUs and third-party PCIe peer devices such as ConnectX SmartNICs or BlueField DPUs.
But once those host-side copies are reduced, the bottleneck does not disappear. It moves inward. The new seam is often inside the accelerator, between high-bandwidth global memory and the much smaller, faster SRAM and register files that actually feed the compute units.
System-level zero-copy is not the end of the story. It just means the remaining bottleneck becomes easier to see.
Your pipeline intuition, and where it breaks
The clean mental model to start with is:
SSD → SmartNIC / DPU → RDMA over fabric → GPU HBM → SRAM → compute
That already eliminates a lot:
Reduced dramatically when the payload no longer bounces through host DRAM.
Bypassed or minimized when GPUDirect Storage is used with a GPU-first application.
Reduced when RDMA and GPUDirect RDMA let the NIC/DPU and GPU exchange data more directly.
But the model breaks at the final step. Even with a nearly ideal system path, there is still a fundamentally different kind of movement remaining:
GPU HBM → on-chip SRAM / registers → tensor cores / ALUs
That movement is not a bad software artifact. It is a consequence of the accelerator's physical memory hierarchy.
Why HBM → SRAM movement exists
Accelerators use multiple memory tiers because no single memory technology gives you huge capacity, ultra-low latency, and extreme energy efficiency per access simultaneously. HBM gives large capacity and high bandwidth, but it is still "farther" from the compute units than on-chip SRAM and register files.
Large and high-bandwidth, but still off-array and relatively expensive to touch every cycle. Tens of GB.
Tiny (~MB), ultra-fast, physically close to compute. This is what actually feeds the math engines efficiently.
So the question "can we avoid HBM → SRAM movement?" has a hard answer: not really. You can reduce it, structure it, overlap it, and reuse it better. But if compute units run from SRAM-like local storage, some form of staging into that local storage remains fundamental to how accelerators work.
Once the bytes have arrived at the accelerator, the remaining problem is no longer "host copies." It is "how efficiently do I feed local compute from the global memory tier?"
Training vs. inference
Training
Training is extremely movement-heavy because you are not only reading weights and activations but also writing gradients, optimizer state, and updated parameters. The HBM↔SRAM path is exercised constantly, and data reuse matters enormously. The system-level storage path is still important for checkpointing, dataset streaming, and distributed state movement, but the dominant steady-state traffic during training is usually inside the accelerator.
Inference
Inference is where the on-chip question becomes especially interesting, because inference increasingly looks memory-bound rather than purely compute-bound. Long-context serving, KV cache pressure, and weight streaming all increase pressure on the memory hierarchy. Once inference is underway, the hot issue is still how weights, activations, and cache state move between global memory and the on-chip working set.
| Mode | Main system concern | Main on-chip concern |
|---|---|---|
| Training | Dataset ingest, checkpoints, distributed synchronization | Weight / activation / gradient movement and reuse across HBM↔SRAM |
| Inference | Request-time data supply, model loading, cache/state movement | Weight / KV / activation movement between HBM and local SRAM per token |
What changes in SRAM-heavy architectures
If inference shifts toward more SRAM-centric designs, the architecture improves in a specific way. It does not make movement disappear — it changes movement from something reactive and repetitive into something structured and schedulable.
Today's common pattern:
HBM ↔ SRAM (frequent, repeating working-set churn)
SRAM-heavy / deliberate pattern:
HBM → SRAM (less frequent, more deliberate residency)
SRAM → multiple operations → writeback later
The win is not "no movement." The win is:
Keep data in SRAM long enough to amortize the cost of bringing it there across multiple operations.
Move smaller, more useful chunks that map cleanly to the local compute structure — not entire tensors reactively.
Load tile N+1 while tile N is being computed, hiding transfer latency under execution.
The better framing is not "can SRAM remove movement?" but "can SRAM-based designs turn movement into a deterministic schedule instead of a reactive penalty?"
How this works on TPU
TPU is a useful reference because it makes the dataflow philosophy more explicit. Google's public TPU architecture describes the heart of computation as large systolic arrays — which already implies a dataflow style: the array wants a steady, structured stream of operands rather than random, cache-like access patterns.
HBM → on-chip buffer / local memory → systolic array → on-chip buffer → HBM
TPU does not magically eliminate the off-chip to on-chip transition. What it does is make that transition feel more like part of the intended execution model. The compiler and runtime know they are feeding a systolic array with tiles — not hoping a general-purpose cache hierarchy behaves well enough.
More general-purpose execution model, more flexible kernels, but often more reactive to kernel structure and runtime choices.
More explicit array feeding model — data motion into local buffers is part of the design center rather than an emergent side effect.
System-level movement vs. on-chip movement
This is the most important conceptual split in the whole discussion:
| Problem class | Example | Can it be reduced? | How |
|---|---|---|---|
| System-level | SSD ↔ SmartNIC/DPU ↔ GPU | Yes, substantially | GPUDirect Storage, RDMA, GPUDirect RDMA, better fabric placement, fewer bounce buffers |
| On-chip | HBM ↔ SRAM / registers | Not eliminated, only optimized | Tiling, streaming, overlap, compiler scheduling, longer residency, more local reuse |
We are close to eliminating many host-side bounce buffers. What remains is the accelerator's own internal memory hierarchy.
What the frontier actually is
The frontier is not "make SRAM directly RDMA-addressable tomorrow." SRAM is not generally exposed as a normal external DMA target, and local memory structures inside accelerators are tightly coupled to timing, buffering, and compute orchestration. What matters more in the near term is making the path from external storage or fabric into the accelerator's working set as predictable as possible.
The winning architectures will likely do two things simultaneously:
SmartNICs, DPUs, GPUDirect Storage, and GPUDirect RDMA so the host CPU is not the payload janitor.
Move data from HBM into SRAM in a planned, compiler- or runtime-aware way rather than through reactive churn.
Design the software so the accelerator is the first and last meaningful consumer of the payload.
System-level zero-copy is becoming real. The next bottleneck is on-chip memory orchestration. The winner will not be the architecture that pretends HBM→SRAM movement disappears, but the one that makes that movement deterministic, overlapped, and reusable.
Reference notes
- NVIDIA's GPUDirect Storage overview describes a direct DMA path between storage and GPU memory that avoids a CPU bounce buffer.
- NVIDIA's GPUDirect RDMA documentation describes direct data exchange between GPUs and third-party PCIe peer devices including ConnectX SmartNICs and BlueField DPUs.
- NVIDIA's GDS design guide notes that the GPU must be the first and/or last meaningful agent touching the transferred data for GDS to help materially.
- NVIDIA's planning documentation notes that GPUDirect performance is best when the GPU and peer devices are close in PCIe topology, ideally under the same switch.
- Google Cloud TPU architecture documentation describes TPUs as systolic-array-based processors built for large matrix operations — which is why data feeding and local buffering are central to their design.