Kernel path analysis · SRAM inference · Linux servers

SRAM-Based Linux Inference Servers: A Kernel-Level Latency Autopsy

SRAM-heavy inference accelerators make the compute path more deterministic. That is exactly why the Linux path around them becomes impossible to ignore. When the device stops being noisy, the host operating system becomes the noise floor.

Thesis: In SRAM-based inference servers, the accelerator may execute predictably, but the request still enters through Linux, leaves through Linux, and waits on Linux. The new bottleneck is the CPU + kernel control plane.

1. Why SRAM-Based Inference Feels Different

In traditional inference systems, much of the performance variance is hidden inside memory hierarchy behavior: cache misses, DRAM access patterns, HBM pressure, contention, kernel launch overhead, queueing, and runtime scheduling. In an SRAM-heavy design, more of the working set is moved into fast on-chip memory, and the execution plan is often shaped more aggressively by the compiler or runtime.

DRAM/HBM-style uncertainty

External memory access can dominate execution time.
Row-buffer locality, bank conflicts, refresh behavior, and queuing can create variable service time.
Large memory hierarchies hide latency but do not eliminate it.
Runtime scheduling and data movement introduce additional jitter.

SRAM-heavy inference behavior

On-chip SRAM access is much more predictable.
The compiler/runtime can schedule dataflow more deterministically.
There are fewer surprise stalls from external memory.
The chip execution window becomes a smaller and more stable part of total latency.

Important nuance: SRAM does not magically make the whole server deterministic. It makes the accelerator execution path more predictable. The server still has Linux scheduling, interrupts, DMA setup, PCIe/CXL/NIC behavior, memory pinning, and userspace wakeups.

2. The End-to-End Request Lifecycle

The inference dataplane may be SRAM-powered, but the request lifecycle is still host-driven. A user request has to be admitted, batched or routed, copied or referenced, submitted to a device queue, completed, and delivered back to userspace.

The accelerator can be deterministic while the server pipeline remains full of variable Linux paths.

Userspace inference request
  → syscall / io_uring
  → kernel submission path
  → driver + DMA programming
  → device execution             // most deterministic part
  → interrupt or polling completion
  → softirq / completion handling
  → scheduler wakeup
  → userspace continuation

3. Submission Path: Where the Request Enters the Kernel

The submit path is often underestimated. Before an accelerator can read request data safely, the host has to ensure that memory is stable, mapped, and visible to the device. That is where page pinning, IOMMU mapping, descriptor construction, and doorbell writes appear.

io_uring_enter()
  → __do_sys_io_uring_enter()
    → io_submit_sqes()
      → io_issue_sqe()
        → driver submit()

Driver submission:
  → pin_user_pages()        // formerly discussed as get_user_pages()
  → dma_map_sg()            // IOMMU translation + DMA visibility
  → build DMA descriptors
  → write device doorbell   // PCIe MMIO

Why `pin_user_pages()` matters

DMA cannot safely target arbitrary pageable userspace memory unless the kernel guarantees those pages will not disappear or move during device access. That guarantee costs time: page-table walks, reference count updates, TLB effects, and sometimes page faults or NUMA surprises.

On a slow compute path, this overhead may be hidden. On a deterministic SRAM inference path, this overhead can become visible and sometimes dominant.

Key optimization bridge:
Use pre-registration when possible. With IORING_REGISTER_BUFFERS, buffers can be registered ahead of time so the hot request path avoids repeated pinning work. The goal is to pay GUP cost during setup, not per inference request.

Submit-path costs that show up in P99

Page pinning: page-table walks, page references, and potential TLB/cache disturbance.
IOMMU mapping: address translation setup for the device view of memory.
Descriptor construction: CPU cacheline activity and queue ownership transitions.
MMIO doorbell: posted writes, ordering rules, and PCIe interaction.
Queue contention: multiple submitters fighting over shared rings or locks.

4. Completion Path: IRQ → SoftIRQ → Wakeup → Schedule

The completion path is the latency autopsy. People often focus on the interrupt handler, but the ISR itself is usually designed to do minimal work. The expensive and variable work is deferred: softirq processing, completion queue handling, waking the target task, and actually getting that task scheduled.

Hardware completion event

→ do_IRQ()
  → irq_enter()
  → handle_irq_event()
    → driver ISR
       // acknowledge device, collect completion metadata,
       // defer heavy processing
  → irq_exit()
    → __raise_softirq_irqoff()

→ __do_softirq()
  → net_rx_action() / blk_done_softirq() / driver completion handler
  → update completion queue
  → mark userspace-visible completion

→ try_to_wake_up()
  → ttwu_queue()
  → sched_ttwu_pending()
  → enqueue task on runqueue

→ schedule()
  → pick_next_task_fair()
  → context_switch()

→ userspace resumes

Wakeup ≠ execution. A task can be made runnable and still not run immediately. It must wait behind scheduler policy, runqueue state, CPU affinity, preemption state, interrupts, and whatever else is occupying that core.

Why softirq is often the real trap

Softirq is a deferred execution mechanism. It is fast when the system is quiet, but it can become a batching point under load. If a CPU is handling a burst of completions, packets, or block I/O events, the target inference thread may not see completion immediately. The hardware finished. The kernel did not finish delivering that fact to userspace.

5. Microsecond-Level Latency Breakdown

Exact numbers depend on CPU generation, kernel version, BIOS settings, IOMMU mode, NUMA placement, interrupt moderation, driver behavior, and load. But the shape of the problem is consistent: Linux overhead can be in the same order of magnitude as the accelerator execution window.

Component	Typical range	Why it varies
Syscall entry/exit	~0.2–1 µs	CPU state, mitigations, cache warmth, syscall type
Buffer pinning / GUP	~2–20+ µs	Page-table walks, page faults, NUMA, TLB/cache effects
DMA / IOMMU mapping	~1–10+ µs	IOMMU state, scatter-gather length, mapping reuse
PCIe/MMIO queue doorbell	~sub-µs to several µs	Ordering, posted writes, bridge behavior, contention
IRQ delivery	~3–15+ µs	Interrupt moderation, CPU state, routing, masking
SoftIRQ / deferred work	~5–50+ µs	Backlog, NAPI budget, batching, local CPU load
Scheduler delay	~10–200+ µs	Runqueue length, CFS/EEVDF decisions, migration, preemption
Context switch	~1–5 µs	Cache warmth, TLB effects, task state

A visual timeline of why a stable accelerator can still produce unstable end-to-end latency.

The uncomfortable conclusion: SRAM execution can be smaller than the scheduler delay alone.

6. Scheduler Effects: The Silent Killer

The scheduler is optimized for fairness, utilization, and general-purpose behavior. Inference serving cares about something narrower: predictable wakeup and bounded tail latency. Those goals often conflict.

try_to_wake_up()
  → task becomes runnable
  → task is enqueued
  → scheduler chooses when it actually runs
  → context_switch()
  → userspace sees completion

Scheduler-driven latency mechanisms

Runqueue contention: the completion thread is runnable but waits behind other work.
CPU migration: the task resumes on a different CPU and loses cache warmth.
Preemption windows: kernel or userspace code may delay when the target can run.
Tick behavior: timer and tick configuration can affect latency distribution.
Fairness vs latency: CFS/EEVDF fairness is not the same as inference deadline control.

Practical mitigation bridge: high-performance inference servers often isolate cores using boot/runtime techniques such as isolcpus, nohz_full, IRQ affinity control, and dedicated polling threads. The idea is not that Linux disappears; it is that Linux is kept away from the hottest cores.

7. Interrupt Mode vs Polling Mode

Interrupts are efficient when work is sparse. Polling is wasteful but predictable when work is constant. SRAM inference servers often operate in a regime where burning a CPU core to protect P99 can be a rational trade.

Interrupt-driven completion

Lower idle CPU usage.
Better for bursty or low-duty workloads.
Completion path crosses IRQ and deferred work.
Tail latency depends on interrupt routing and scheduler behavior.

Polling-driven completion

Higher CPU usage.
Better for stable, high-throughput inference service loops.
Can avoid IRQ/softirq wakeup chains.
Turns latency variance into reserved CPU capacity.

Dedicated polling loop:
  bind thread to isolated CPU
  pre-register buffers
  submit work through shared ring
  while service is active:
      poll completion queue
      process completions immediately
      submit next ready batch

This is why the CPU should not be treated as a passive host. In these servers, the CPU is the timing controller, queue manager, and completion delivery mechanism for the accelerator.

8. Tail at Scale: Why Small Delays Explode at P99

The post earlier said variance compounds. More precisely, the server’s tail latency is not determined by one component. It is determined by the slowest unlucky component in the chain. If a request crosses ten stages, and each stage has a small probability of a bad delay, the chance that at least one stage hits a bad delay is much larger than the chance of one isolated delay.

This is the classic Tail at Scale problem: a small amount of jitter in many parallel or serial components becomes a large end-to-end P99 problem. SRAM reduces variance in compute, but Linux still contributes several independent jitter sources.

P50 path:
  warm thread + no GUP surprise + clean IRQ + empty runqueue

P99 path:
  GUP slow path + IOMMU churn + IRQ delay + softirq backlog + scheduler wait

If deterministic compute removes the big obvious source of variance, the remaining kernel micro-delays become the new product experience.

9. The CPU’s Responsibility in an SRAM-Based Inference Server

In the GPU-era mental model, the CPU “feeds” the accelerator. In the SRAM-inference mental model, that framing is too weak. The CPU is the control-plane engine that determines whether the deterministic dataplane is actually reachable at low latency.

CPU responsibility	Why it matters
Request admission	Prevents queue overload from turning deterministic execution into user-visible tail latency.
Batch shaping	Balances throughput against per-request latency and deadline sensitivity.
Memory registration	Moves GUP/pinning cost out of the hot path using fixed or pre-registered buffers.
DMA orchestration	Controls descriptor layout, queue depth, IOMMU reuse, and device-visible memory flow.
Core affinity	Keeps submission/completion threads on hot, isolated cores.
Interrupt steering	Routes completions away from noisy cores and toward service loops that can act immediately.
Tail protection	Detects backlog, drops or reroutes work, and avoids P99 collapse.

This is the heart of the argument: the CPU is no longer just a general-purpose host. It is the inference control plane.

10. Problem → Mitigation Map

This post is not yet a full redesign proposal, but the mitigation map is already visible. Each kernel cost points toward a specific systems technique.

Problem	Kernel-level source	Mitigation direction
Per-request memory pinning	`pin_user_pages()`, page-table walks	`IORING_REGISTER_BUFFERS`, fixed buffers, pre-pinned DMA pools
Syscall overhead	Repeated kernel entry/exit	`io_uring`, SQPOLL-style submission, shared rings
Completion jitter	IRQ + softirq + wakeup	Polling, interrupt affinity, completion coalescing tuned for latency
Scheduler delay	Runqueue contention, fairness policy	`isolcpus`, `nohz_full`, CPU affinity, dedicated service cores
Cache-cold wakeups	Task migration and shared CPU noise	Pin service threads, avoid migration, partition cores by role
Tail amplification	Independent jitter sources across layers	Admission control, bounded queues, deadline-aware dispatch

11. Architectural Conclusion

SRAM-based inference changes the system bottleneck by making accelerator execution less mysterious. That is a good thing, but it means the host can no longer hide behind the device.

The server now looks like this:

Total latency =
    Linux submission path
  + DMA and queueing
  + deterministic SRAM execution
  + Linux completion path
  + scheduler wakeup

The stable part is the accelerator. The unstable parts are the paths into and out of it.

The bottleneck did not disappear. It moved into the Linux control plane.

12. Final Statement

The next phase of inference performance will not be won only by faster dataplanes. It will be won by making the CPU, kernel, driver, DMA, and completion path behave like a deterministic inference pipeline instead of a generic best-effort operating system path.

If you want faster SRAM-based inference servers, do not stop at the chip. Follow the request through Linux: syscall, GUP, DMA, IRQ, softirq, wakeup, and schedule. That is where the next bottleneck lives.

1. Why SRAM-Based Inference Feels Different

DRAM/HBM-style uncertainty

SRAM-heavy inference behavior

2. The End-to-End Request Lifecycle

3. Submission Path: Where the Request Enters the Kernel

Why pin_user_pages() matters

Submit-path costs that show up in P99

4. Completion Path: IRQ → SoftIRQ → Wakeup → Schedule

Why softirq is often the real trap

5. Microsecond-Level Latency Breakdown

6. Scheduler Effects: The Silent Killer

Scheduler-driven latency mechanisms

7. Interrupt Mode vs Polling Mode

Interrupt-driven completion

Polling-driven completion

8. Tail at Scale: Why Small Delays Explode at P99

9. The CPU’s Responsibility in an SRAM-Based Inference Server

10. Problem → Mitigation Map

11. Architectural Conclusion

12. Final Statement

Why `pin_user_pages()` matters