Systems for AI · SmartNICs · HBM/SRAM · TPU vs GPU

From SSD to GPU to SRAM: Why the Last Bottleneck Is Now On-Chip

Once host-memory bounce buffers are reduced, the bottleneck shifts inward. The next frontier is not just storage-to-GPU delivery, but deterministic movement between HBM and on-chip SRAM.

Manish KLApril 2026~14 min readTechnical Essay
The Full Pipeline: System-Level Copies vs On-Chip Movement SYSTEM-LEVEL PATH — can be cleaned up SSD local / remote GDS SmartNIC / DPU RDMA endpoint RDMA GPU HBM tens of GB · GDs admissible GPUDirect Storage + RDMA eliminates CPU copies ✓ host bounce buffers, page cache, CPU rendezvous → reduced dramatically BOUNDARY ON-CHIP PATH — cannot be eliminated GPU HBM shared with system path → tile / block SRAM / Regs feeds compute units HBM→SRAM movement is physical — cannot be removed ✗ only scheduled, overlapped, tiled, and reused better System-level zero-copy is becoming real. The next bottleneck is on-chip memory orchestration. Winner: makes HBM→SRAM movement deterministic, overlapped, and reusable — not eliminated.

There is a natural next step once you start thinking about SmartNICs, GPUDirect, and zero-copy system design: if storage can reach the accelerator without all the old CPU and host-memory overhead, can the rest of the path become just as clean?

The answer is subtle. At the system level, yes — a great deal of waste can be removed. GPUDirect Storage provides a direct DMA path between storage and GPU memory while avoiding a CPU bounce buffer, and GPUDirect RDMA enables direct data exchange between GPUs and third-party PCIe peer devices such as ConnectX SmartNICs or BlueField DPUs.

But once those host-side copies are reduced, the bottleneck does not disappear. It moves inward. The new seam is often inside the accelerator, between high-bandwidth global memory and the much smaller, faster SRAM and register files that actually feed the compute units.

System-level zero-copy is not the end of the story. It just means the remaining bottleneck becomes easier to see.

Your pipeline intuition, and where it breaks

The clean mental model to start with is:

SSD → SmartNIC / DPU → RDMA over fabric → GPU HBM → SRAM → compute

That already eliminates a lot:

CPU copies

Reduced dramatically when the payload no longer bounces through host DRAM.

Page-cache-heavy paths

Bypassed or minimized when GPUDirect Storage is used with a GPU-first application.

Host rendezvous points

Reduced when RDMA and GPUDirect RDMA let the NIC/DPU and GPU exchange data more directly.

But the model breaks at the final step. Even with a nearly ideal system path, there is still a fundamentally different kind of movement remaining:

GPU HBM → on-chip SRAM / registers → tensor cores / ALUs

That movement is not a bad software artifact. It is a consequence of the accelerator's physical memory hierarchy.

Why HBM → SRAM movement exists

Accelerators use multiple memory tiers because no single memory technology gives you huge capacity, ultra-low latency, and extreme energy efficiency per access simultaneously. HBM gives large capacity and high bandwidth, but it is still "farther" from the compute units than on-chip SRAM and register files.

HBM / global memory

Large and high-bandwidth, but still off-array and relatively expensive to touch every cycle. Tens of GB.

SRAM / registers / shared buffers

Tiny (~MB), ultra-fast, physically close to compute. This is what actually feeds the math engines efficiently.

So the question "can we avoid HBM → SRAM movement?" has a hard answer: not really. You can reduce it, structure it, overlap it, and reuse it better. But if compute units run from SRAM-like local storage, some form of staging into that local storage remains fundamental to how accelerators work.

Once the bytes have arrived at the accelerator, the remaining problem is no longer "host copies." It is "how efficiently do I feed local compute from the global memory tier?"

Training vs. inference

Training

Training is extremely movement-heavy because you are not only reading weights and activations but also writing gradients, optimizer state, and updated parameters. The HBM↔SRAM path is exercised constantly, and data reuse matters enormously. The system-level storage path is still important for checkpointing, dataset streaming, and distributed state movement, but the dominant steady-state traffic during training is usually inside the accelerator.

Inference

Inference is where the on-chip question becomes especially interesting, because inference increasingly looks memory-bound rather than purely compute-bound. Long-context serving, KV cache pressure, and weight streaming all increase pressure on the memory hierarchy. Once inference is underway, the hot issue is still how weights, activations, and cache state move between global memory and the on-chip working set.

ModeMain system concernMain on-chip concern
TrainingDataset ingest, checkpoints, distributed synchronizationWeight / activation / gradient movement and reuse across HBM↔SRAM
InferenceRequest-time data supply, model loading, cache/state movementWeight / KV / activation movement between HBM and local SRAM per token

What changes in SRAM-heavy architectures

If inference shifts toward more SRAM-centric designs, the architecture improves in a specific way. It does not make movement disappear — it changes movement from something reactive and repetitive into something structured and schedulable.

Today's common pattern:
HBM ↔ SRAM  (frequent, repeating working-set churn)

SRAM-heavy / deliberate pattern:
HBM → SRAM  (less frequent, more deliberate residency)
SRAM → multiple operations → writeback later

The win is not "no movement." The win is:

Longer residency

Keep data in SRAM long enough to amortize the cost of bringing it there across multiple operations.

Better tiling

Move smaller, more useful chunks that map cleanly to the local compute structure — not entire tensors reactively.

Overlap

Load tile N+1 while tile N is being computed, hiding transfer latency under execution.

The better framing is not "can SRAM remove movement?" but "can SRAM-based designs turn movement into a deterministic schedule instead of a reactive penalty?"

How this works on TPU

TPU is a useful reference because it makes the dataflow philosophy more explicit. Google's public TPU architecture describes the heart of computation as large systolic arrays — which already implies a dataflow style: the array wants a steady, structured stream of operands rather than random, cache-like access patterns.

HBM → on-chip buffer / local memory → systolic array → on-chip buffer → HBM

TPU does not magically eliminate the off-chip to on-chip transition. What it does is make that transition feel more like part of the intended execution model. The compiler and runtime know they are feeding a systolic array with tiles — not hoping a general-purpose cache hierarchy behaves well enough.

GPU vs TPU: Memory Feeding Philosophy GPU TENDENCY HBM SRAM / SM CUDA Cores general-purpose flexible kernels reactive to kernel structure choices movement: reactive to runtime / tiling choices TPU TENDENCY HBM On-Chip Buffer compiler-scheduled tiles Systolic Array explicit dataflow array-feeding model movement is part of the design center movement: scheduled, compiler-aware, structured both still have HBM→local movement — TPU just makes it feel more intentional and scheduled
Figure 1. GPU vs TPU memory feeding philosophy. Both require HBM→local movement — the difference is how intentional and compiler-scheduled that movement is.
GPU tendency

More general-purpose execution model, more flexible kernels, but often more reactive to kernel structure and runtime choices.

TPU tendency

More explicit array feeding model — data motion into local buffers is part of the design center rather than an emergent side effect.

System-level movement vs. on-chip movement

This is the most important conceptual split in the whole discussion:

Problem classExampleCan it be reduced?How
System-levelSSD ↔ SmartNIC/DPU ↔ GPUYes, substantiallyGPUDirect Storage, RDMA, GPUDirect RDMA, better fabric placement, fewer bounce buffers
On-chipHBM ↔ SRAM / registersNot eliminated, only optimizedTiling, streaming, overlap, compiler scheduling, longer residency, more local reuse

We are close to eliminating many host-side bounce buffers. What remains is the accelerator's own internal memory hierarchy.

What the frontier actually is

The frontier is not "make SRAM directly RDMA-addressable tomorrow." SRAM is not generally exposed as a normal external DMA target, and local memory structures inside accelerators are tightly coupled to timing, buffering, and compute orchestration. What matters more in the near term is making the path from external storage or fabric into the accelerator's working set as predictable as possible.

The winning architectures will likely do two things simultaneously:

Cleaner system ingress

SmartNICs, DPUs, GPUDirect Storage, and GPUDirect RDMA so the host CPU is not the payload janitor.

Deterministic local scheduling

Move data from HBM into SRAM in a planned, compiler- or runtime-aware way rather than through reactive churn.

GPU-first or array-first

Design the software so the accelerator is the first and last meaningful consumer of the payload.

System-level zero-copy is becoming real. The next bottleneck is on-chip memory orchestration. The winner will not be the architecture that pretends HBM→SRAM movement disappears, but the one that makes that movement deterministic, overlapped, and reusable.

Reference notes

  • NVIDIA's GPUDirect Storage overview describes a direct DMA path between storage and GPU memory that avoids a CPU bounce buffer.
  • NVIDIA's GPUDirect RDMA documentation describes direct data exchange between GPUs and third-party PCIe peer devices including ConnectX SmartNICs and BlueField DPUs.
  • NVIDIA's GDS design guide notes that the GPU must be the first and/or last meaningful agent touching the transferred data for GDS to help materially.
  • NVIDIA's planning documentation notes that GPUDirect performance is best when the GPU and peer devices are close in PCIe topology, ideally under the same switch.
  • Google Cloud TPU architecture documentation describes TPUs as systolic-array-based processors built for large matrix operations — which is why data feeding and local buffering are central to their design.