Home
Back to writings
Architecture patent explainer

Deterministic Memory-Orchestrated Inference Using DMA and Bounded On-Chip Buffers

A technical deep dive into a compiler-scheduled inference architecture where DMA, explicit fences, and bounded on-chip buffers pull DRAM off the compute critical path.

Technical blog post 14 min read Dark HTML standalone

Executive summary

This patent argues for a different way to run neural network inference: treat off-chip memory as a deterministic streaming substrate and keep compute fed from bounded on-chip buffers through compiler-generated DMA schedules, explicit fences, and controlled buffer reuse.

The core architectural invariant is powerful: compute units should not directly touch DRAM for most of steady-state inference. DRAM traffic is moved off the compute critical path and handled by scheduled DMA transfers hidden behind prefetch and double buffering.

The core idea: make memory movement scheduled and explicit

Most inference platforms pay a hidden tax to memory uncertainty. Caches miss. Prefetchers guess. Page behavior varies. A large model works well until it overflows a comfortable capacity threshold and then stalls become dominant. This patent’s answer is clean: stop relying on speculative behavior and instead let a compiler generate a deterministic schedule for moving tiles into bounded on-chip buffers.

That shifts the architecture away from “hope the cache hierarchy behaves” and toward “prove that the required tile is ready before compute starts.”

Architectural invariant: during steady-state inference, compute runs on pre-staged data in bounded on-chip buffers, while DRAM access is pushed behind DMA and fence logic.

System architecture

Pinned DRAM weights / activations / KV blocks Streaming DMA descriptors · channels · fences Bounded SRAM ping-pong buffers A / B Compute CPU / GPU / NPU / ASIC Compiler / Runtime Scheduler
Figure 1. The patent’s central pipeline: pinned memory, descriptor-driven DMA, bounded on-chip buffers, and a scheduler that decides when data becomes legally consumable.

Why bounded on-chip SRAM is the point

The design does not try to fit the whole model on chip. It embraces the opposite strategy: keep on-chip storage bounded and let it scale with tile size and parallelism rather than with total model size. That is a strong silicon argument. If the compiler can prove tile readiness and reuse, you can get deterministic behavior without building huge SRAM structures.

This is what gives the patent a useful hardware-software split: the hardware provides DMA, buffers, and fence semantics; the compiler provides the schedule discipline that makes those small buffers sufficient.

Double buffering and fence-gated compute

Time → DMA Compute Load Tile 0 → A Load Tile 1 → B Load Tile 2 → A Compute A Compute B Compute A
Figure 2. The point of double buffering is not just overlap. It is deterministic legality: compute starts only after the matching buffer-ready event.

Compiler responsibility

The patent’s compiler does real systems work:

  • graph analysis over a fixed operator order,
  • tensor lifetime and reuse analysis,
  • bounded buffer placement under fixed capacity,
  • DMA descriptor generation,
  • fence insertion and explicit release points,
  • static determinism verification.

That is not a toy “prefetch hint” compiler. It is a full orchestration compiler.

Hardware hooks that make it enabling

The draft goes further than a high-level concept by specifying:

  • example fixed-width DMA descriptors,
  • hardware semaphores and completion counters,
  • multi-channel DMA to reduce head-of-line blocking,
  • compressed streaming and on-the-fly decompression,
  • multicast DMA for multi-engine fanout,
  • snapshot / resume rules at fence boundaries.

Compiler flow as a first-class diagram

Graph Lifetime Analysis Placement Descriptors + Fences Static Determinism Check
Figure 3. A good way to explain the invention: not memory management as runtime improvisation, but memory management as a verified compile-time plan.

Why this matters

There are at least three reasons this direction is attractive. First, it gives more predictable latency, which is crucial for real-time and safety-adjacent deployments. Second, it keeps SRAM requirements bounded instead of letting them balloon with model size. Third, it creates a path to architecture portability because the idea is not tied to one compute ISA. It is a cross-layer scheduling discipline.

In plain English: this patent says that instead of designing inference systems around cache luck, we can design them around explicit memory choreography.