A technical deep dive into a compiler-scheduled inference architecture where DMA, explicit fences, and bounded on-chip buffers pull DRAM off the compute critical path.
This patent argues for a different way to run neural network inference: treat off-chip memory as a deterministic streaming substrate and keep compute fed from bounded on-chip buffers through compiler-generated DMA schedules, explicit fences, and controlled buffer reuse.
The core architectural invariant is powerful: compute units should not directly touch DRAM for most of steady-state inference. DRAM traffic is moved off the compute critical path and handled by scheduled DMA transfers hidden behind prefetch and double buffering.
Most inference platforms pay a hidden tax to memory uncertainty. Caches miss. Prefetchers guess. Page behavior varies. A large model works well until it overflows a comfortable capacity threshold and then stalls become dominant. This patent’s answer is clean: stop relying on speculative behavior and instead let a compiler generate a deterministic schedule for moving tiles into bounded on-chip buffers.
That shifts the architecture away from “hope the cache hierarchy behaves” and toward “prove that the required tile is ready before compute starts.”
The design does not try to fit the whole model on chip. It embraces the opposite strategy: keep on-chip storage bounded and let it scale with tile size and parallelism rather than with total model size. That is a strong silicon argument. If the compiler can prove tile readiness and reuse, you can get deterministic behavior without building huge SRAM structures.
This is what gives the patent a useful hardware-software split: the hardware provides DMA, buffers, and fence semantics; the compiler provides the schedule discipline that makes those small buffers sufficient.
The patent’s compiler does real systems work:
That is not a toy “prefetch hint” compiler. It is a full orchestration compiler.
The draft goes further than a high-level concept by specifying:
There are at least three reasons this direction is attractive. First, it gives more predictable latency, which is crucial for real-time and safety-adjacent deployments. Second, it keeps SRAM requirements bounded instead of letting them balloon with model size. Third, it creates a path to architecture portability because the idea is not tied to one compute ISA. It is a cross-layer scheduling discipline.
In plain English: this patent says that instead of designing inference systems around cache luck, we can design them around explicit memory choreography.