SRAM-Based Linux Inference Servers: A Kernel-Level Latency Autopsy
SRAM-heavy inference accelerators make the compute path more deterministic. That is exactly why the Linux path around them becomes impossible to ignore. When the device stops being noisy, the host operating system becomes the noise floor.
1. Why SRAM-Based Inference Feels Different
In traditional inference systems, much of the performance variance is hidden inside memory hierarchy behavior: cache misses, DRAM access patterns, HBM pressure, contention, kernel launch overhead, queueing, and runtime scheduling. In an SRAM-heavy design, more of the working set is moved into fast on-chip memory, and the execution plan is often shaped more aggressively by the compiler or runtime.
DRAM/HBM-style uncertainty
- External memory access can dominate execution time.
- Row-buffer locality, bank conflicts, refresh behavior, and queuing can create variable service time.
- Large memory hierarchies hide latency but do not eliminate it.
- Runtime scheduling and data movement introduce additional jitter.
SRAM-heavy inference behavior
- On-chip SRAM access is much more predictable.
- The compiler/runtime can schedule dataflow more deterministically.
- There are fewer surprise stalls from external memory.
- The chip execution window becomes a smaller and more stable part of total latency.
2. The End-to-End Request Lifecycle
The inference dataplane may be SRAM-powered, but the request lifecycle is still host-driven. A user request has to be admitted, batched or routed, copied or referenced, submitted to a device queue, completed, and delivered back to userspace.
Userspace inference request → syscall / io_uring → kernel submission path → driver + DMA programming → device execution // most deterministic part → interrupt or polling completion → softirq / completion handling → scheduler wakeup → userspace continuation
3. Submission Path: Where the Request Enters the Kernel
The submit path is often underestimated. Before an accelerator can read request data safely, the host has to ensure that memory is stable, mapped, and visible to the device. That is where page pinning, IOMMU mapping, descriptor construction, and doorbell writes appear.
io_uring_enter()
→ __do_sys_io_uring_enter()
→ io_submit_sqes()
→ io_issue_sqe()
→ driver submit()
Driver submission:
→ pin_user_pages() // formerly discussed as get_user_pages()
→ dma_map_sg() // IOMMU translation + DMA visibility
→ build DMA descriptors
→ write device doorbell // PCIe MMIO
Why pin_user_pages() matters
DMA cannot safely target arbitrary pageable userspace memory unless the kernel guarantees those pages will not disappear or move during device access. That guarantee costs time: page-table walks, reference count updates, TLB effects, and sometimes page faults or NUMA surprises.
On a slow compute path, this overhead may be hidden. On a deterministic SRAM inference path, this overhead can become visible and sometimes dominant.
Use pre-registration when possible. With
IORING_REGISTER_BUFFERS,
buffers can be registered ahead of time so the hot request path avoids repeated pinning work.
The goal is to pay GUP cost during setup, not per inference request.
Submit-path costs that show up in P99
- Page pinning: page-table walks, page references, and potential TLB/cache disturbance.
- IOMMU mapping: address translation setup for the device view of memory.
- Descriptor construction: CPU cacheline activity and queue ownership transitions.
- MMIO doorbell: posted writes, ordering rules, and PCIe interaction.
- Queue contention: multiple submitters fighting over shared rings or locks.
4. Completion Path: IRQ → SoftIRQ → Wakeup → Schedule
The completion path is the latency autopsy. People often focus on the interrupt handler, but the ISR itself is usually designed to do minimal work. The expensive and variable work is deferred: softirq processing, completion queue handling, waking the target task, and actually getting that task scheduled.
Hardware completion event
→ do_IRQ()
→ irq_enter()
→ handle_irq_event()
→ driver ISR
// acknowledge device, collect completion metadata,
// defer heavy processing
→ irq_exit()
→ __raise_softirq_irqoff()
→ __do_softirq()
→ net_rx_action() / blk_done_softirq() / driver completion handler
→ update completion queue
→ mark userspace-visible completion
→ try_to_wake_up()
→ ttwu_queue()
→ sched_ttwu_pending()
→ enqueue task on runqueue
→ schedule()
→ pick_next_task_fair()
→ context_switch()
→ userspace resumes
Why softirq is often the real trap
Softirq is a deferred execution mechanism. It is fast when the system is quiet, but it can become a batching point under load. If a CPU is handling a burst of completions, packets, or block I/O events, the target inference thread may not see completion immediately. The hardware finished. The kernel did not finish delivering that fact to userspace.
5. Microsecond-Level Latency Breakdown
Exact numbers depend on CPU generation, kernel version, BIOS settings, IOMMU mode, NUMA placement, interrupt moderation, driver behavior, and load. But the shape of the problem is consistent: Linux overhead can be in the same order of magnitude as the accelerator execution window.
| Component | Typical range | Why it varies |
|---|---|---|
| Syscall entry/exit | ~0.2–1 µs | CPU state, mitigations, cache warmth, syscall type |
| Buffer pinning / GUP | ~2–20+ µs | Page-table walks, page faults, NUMA, TLB/cache effects |
| DMA / IOMMU mapping | ~1–10+ µs | IOMMU state, scatter-gather length, mapping reuse |
| PCIe/MMIO queue doorbell | ~sub-µs to several µs | Ordering, posted writes, bridge behavior, contention |
| IRQ delivery | ~3–15+ µs | Interrupt moderation, CPU state, routing, masking |
| SoftIRQ / deferred work | ~5–50+ µs | Backlog, NAPI budget, batching, local CPU load |
| Scheduler delay | ~10–200+ µs | Runqueue length, CFS/EEVDF decisions, migration, preemption |
| Context switch | ~1–5 µs | Cache warmth, TLB effects, task state |
The uncomfortable conclusion: SRAM execution can be smaller than the scheduler delay alone.
6. Scheduler Effects: The Silent Killer
The scheduler is optimized for fairness, utilization, and general-purpose behavior. Inference serving cares about something narrower: predictable wakeup and bounded tail latency. Those goals often conflict.
try_to_wake_up() → task becomes runnable → task is enqueued → scheduler chooses when it actually runs → context_switch() → userspace sees completion
Scheduler-driven latency mechanisms
- Runqueue contention: the completion thread is runnable but waits behind other work.
- CPU migration: the task resumes on a different CPU and loses cache warmth.
- Preemption windows: kernel or userspace code may delay when the target can run.
- Tick behavior: timer and tick configuration can affect latency distribution.
- Fairness vs latency: CFS/EEVDF fairness is not the same as inference deadline control.
isolcpus, nohz_full, IRQ affinity control, and dedicated
polling threads. The idea is not that Linux disappears; it is that Linux is kept away from the hottest cores.
7. Interrupt Mode vs Polling Mode
Interrupts are efficient when work is sparse. Polling is wasteful but predictable when work is constant. SRAM inference servers often operate in a regime where burning a CPU core to protect P99 can be a rational trade.
Interrupt-driven completion
- Lower idle CPU usage.
- Better for bursty or low-duty workloads.
- Completion path crosses IRQ and deferred work.
- Tail latency depends on interrupt routing and scheduler behavior.
Polling-driven completion
- Higher CPU usage.
- Better for stable, high-throughput inference service loops.
- Can avoid IRQ/softirq wakeup chains.
- Turns latency variance into reserved CPU capacity.
Dedicated polling loop:
bind thread to isolated CPU
pre-register buffers
submit work through shared ring
while service is active:
poll completion queue
process completions immediately
submit next ready batch
This is why the CPU should not be treated as a passive host. In these servers, the CPU is the timing controller, queue manager, and completion delivery mechanism for the accelerator.
8. Tail at Scale: Why Small Delays Explode at P99
The post earlier said variance compounds. More precisely, the server’s tail latency is not determined by one component. It is determined by the slowest unlucky component in the chain. If a request crosses ten stages, and each stage has a small probability of a bad delay, the chance that at least one stage hits a bad delay is much larger than the chance of one isolated delay.
P50 path: warm thread + no GUP surprise + clean IRQ + empty runqueue P99 path: GUP slow path + IOMMU churn + IRQ delay + softirq backlog + scheduler wait
If deterministic compute removes the big obvious source of variance, the remaining kernel micro-delays become the new product experience.
9. The CPU’s Responsibility in an SRAM-Based Inference Server
In the GPU-era mental model, the CPU “feeds” the accelerator. In the SRAM-inference mental model, that framing is too weak. The CPU is the control-plane engine that determines whether the deterministic dataplane is actually reachable at low latency.
| CPU responsibility | Why it matters |
|---|---|
| Request admission | Prevents queue overload from turning deterministic execution into user-visible tail latency. |
| Batch shaping | Balances throughput against per-request latency and deadline sensitivity. |
| Memory registration | Moves GUP/pinning cost out of the hot path using fixed or pre-registered buffers. |
| DMA orchestration | Controls descriptor layout, queue depth, IOMMU reuse, and device-visible memory flow. |
| Core affinity | Keeps submission/completion threads on hot, isolated cores. |
| Interrupt steering | Routes completions away from noisy cores and toward service loops that can act immediately. |
| Tail protection | Detects backlog, drops or reroutes work, and avoids P99 collapse. |
This is the heart of the argument: the CPU is no longer just a general-purpose host. It is the inference control plane.
10. Problem → Mitigation Map
This post is not yet a full redesign proposal, but the mitigation map is already visible. Each kernel cost points toward a specific systems technique.
| Problem | Kernel-level source | Mitigation direction |
|---|---|---|
| Per-request memory pinning | pin_user_pages(), page-table walks |
IORING_REGISTER_BUFFERS, fixed buffers, pre-pinned DMA pools |
| Syscall overhead | Repeated kernel entry/exit | io_uring, SQPOLL-style submission, shared rings |
| Completion jitter | IRQ + softirq + wakeup | Polling, interrupt affinity, completion coalescing tuned for latency |
| Scheduler delay | Runqueue contention, fairness policy | isolcpus, nohz_full, CPU affinity, dedicated service cores |
| Cache-cold wakeups | Task migration and shared CPU noise | Pin service threads, avoid migration, partition cores by role |
| Tail amplification | Independent jitter sources across layers | Admission control, bounded queues, deadline-aware dispatch |
11. Architectural Conclusion
SRAM-based inference changes the system bottleneck by making accelerator execution less mysterious. That is a good thing, but it means the host can no longer hide behind the device.
The server now looks like this:
Total latency =
Linux submission path
+ DMA and queueing
+ deterministic SRAM execution
+ Linux completion path
+ scheduler wakeup
The stable part is the accelerator. The unstable parts are the paths into and out of it.
The bottleneck did not disappear. It moved into the Linux control plane.
12. Final Statement
The next phase of inference performance will not be won only by faster dataplanes. It will be won by making the CPU, kernel, driver, DMA, and completion path behave like a deterministic inference pipeline instead of a generic best-effort operating system path.