Why LLM Prefill and Decode Suffer from Cache Misses, Branch Stalls, and Pipeline Bubbles on CPUs
Raw FLOPs and clock speed tell only half the story. The real performance battle in CPU-based LLM inference is about how well the processor keeps its execution machinery fed—and how often it is forced to wait.
To reduce the impact of cache misses, branch stalls, and pipeline bubbles in LLM prefill and decode, make the hot path regular, vectorized, cache-local, and prefetchable.
Why This Matters
When people first look at LLM inference, they focus on parameter count, tokens per second, or memory bandwidth. Those all matter. But once you get close to actual CPU execution, the real question becomes: how often is the processor forced to wait?
Those waits show up in three recurring forms:
- Cache misses — the data you want is not in the cache level you hoped for, so you fall back to a slower level.
- Branch stalls — the CPU predicts the wrong control-flow path and has to recover.
- Pipeline bubbles — parts of the execution pipeline sit idle because the next useful instruction or data is not ready.
Prefill and decode suffer from these effects differently, but both are deeply shaped by them.
Prefill and Decode Are Different Workloads
Before analyzing the stall types, it helps to understand that prefill and decode place very different pressure on the CPU. They are not two phases of the same operation—they are fundamentally different computational regimes.
Prefill
- All prompt tokens available upfront
- Large batched matrix operations
- Good SIMD / AVX utilization
- Predictable memory access patterns
- Cache reuse across token batch
Decode
- One token generated at a time
- Narrow GEMV — undersized for SIMD
- Repeated weight reads every step
- Growing KV cache pressure
- Stronger sensitivity to tail latency
Cache Misses
CPUs rely on a hierarchy of caches. L1 is smallest and fastest, L2 larger and slower,
L3 larger still, and DRAM is dramatically slower than any on-core cache.
A cache miss happens when data needed by the core is absent from the expected level and must be fetched from farther away.
Why Cache Misses Hurt LLM Inference
LLM inference constantly touches large model weights, per-layer KV cache, activations, and temporary buffers. For a 7B-parameter model at fp16, weights alone occupy ~14 GB—far beyond any L3 cache. Every decode step re-reads a large portion of those weights from DRAM.
Cache Misses in Prefill vs Decode
Prefill is easier to optimize: multiple prompt tokens are processed together, kernels can reuse activation blocks and weight tiles before eviction, and blocked matrix operations achieve much better locality.
Decode is more fragile. It keeps rereading weights with no natural batching. Long context windows increase the KV footprint further, making DRAM trips directly visible in token latency.
How to Reduce the Impact
- Use cache-friendly tensor layouts (row-major with SIMD alignment)
- Block and tile matrix operations (keep working set ≤ L2/L3 per tile)
- Quantize weights so more fit into higher cache levels (INT8/INT4)
- Keep NUMA locality correct — pin threads to their memory domain
- Separate hot data from cold data (weights vs activations)
- Stage predictable weight tiles before they are needed (SW prefetch)
Branch Stalls
Modern CPUs predict branches before they are resolved so the pipeline stays busy. When the prediction is wrong, the CPU must discard speculative work and restart on the correct path—burning 10–20+ cycles in the process.
Branch Stalls in Prefill vs Decode
Prefill often has relatively structured control flow: large matrix operations, stable loops, and predictable token blocks. Well-written prefill kernels keep branch mispredictions low.
Decode has more opportunities for runtime-sensitive behavior to leak inward: sequence-length handling, KV management, dynamic batching policies, and scheduling decisions. If those branches enter the numerical hot loop, performance degrades quickly.
How to Reduce the Impact
- Move policy logic outside tight compute loops
- Use specialized kernels for common cases (e.g. fixed sequence lengths)
- Prefer table-driven decisions over unpredictable inner-loop branching
- Keep the hot path regular and shape-stable across iterations
Pipeline Bubbles
A pipeline bubble is an underutilized cycle where part of the CPU's execution machinery goes idle because the next instruction or data is not ready. Bubbles can arise from memory waits, branch recovery, long dependency chains, or poor instruction scheduling.
Pipeline Behavior in Prefill vs Decode
Prefill gives the processor more room to hide bubbles: operations are larger, more vectorized, and richer in instruction-level parallelism. The compiler and kernel can unroll and schedule more effectively.
Decode is more exposed. There is less fresh token-level parallelism, more memory sensitivity, and smaller units of sequential progress. One missing tile or one dependent chain can leave execution units idle for dozens of cycles.
How to Reduce the Impact
- Vectorize aggressively — fill the SIMD units fully (
AVX-512,SVE2) - Unroll hot loops carefully to expose more independent work
- Overlap loads with compute wherever the dependency graph allows
- Reduce unnecessary data dependencies in kernel design
- Keep more independent work in-flight for out-of-order execution to exploit
What "Regular, Vectorized, Cache-Local, and Prefetchable" Really Means
These four properties form the complete target state for any CPU inference hot path. They are not independent optimizations—they compound each other.
Who Helps Avoid What?
A key insight is that no single actor owns this problem. Performance is co-produced by algorithm structure, runtime decisions, hand-tuned kernels, and CPU microarchitecture working together.
| Stall Type | Primary Mitigations |
|---|---|
| Cache misses | Runtime layout Kernel design Hardware caches Quantization |
| Branch stalls | Kernel design Compiler shaping Branch predictor |
| Pipeline bubbles | Compiler scheduling Kernel design Out-of-order HW Vectorization |
The important takeaway: this is not "the compiler's job" alone. The compiler helps, but algorithm structure, runtime decisions, hand-tuned libraries, and CPU microarchitecture all must work together.
Why Agentic AI Makes This More Important
Agentic systems amplify these issues because they often have long prompts, tool outputs, retrieved context, scratchpads, and multiple model invocations in a single turn.
Agentic workload pressures
- Prefill cost grows with context volume
- Decode latency is directly visible to users
- KV cache grows significantly across turns
- Orchestration overhead competes with model execution
- Multiple model invocations multiply every inefficiency
On CPUs, that makes locality, vectorization, and prefetchability even more critical. It is not enough to "have enough cores." The entire hot path must be shaped so that the processor spends more time computing and less time waiting.
in agentic vs. single-turn inference
Practical Optimization Checklist
- Use quantized weights (
INT8/INT4) where acceptable for quality - Use cache-blocked GEMM and attention kernels — keep tiles in L2/L3
- Keep hot loops branch-light — move policy logic outside numerical kernels
- Pin threads to physical cores; benchmark SMT/HT on and off for your workload
- Keep NUMA locality correct — run model weights in the socket's local memory
- Separate weight traffic from KV cache traffic where possible
- Prefetch predictable weight tiles with software prefetch instructions
- Minimize decode-path randomness — stable kernel shapes help branch prediction
- Use
AVX-512/SVE2where available; verify vectorization with compiler reports
Final Takeaway
Performance comes from how well the system turns large, repetitive tensor workloads into a regular, locality-friendly, vectorized stream that the processor can execute without constantly stalling. Core count and clock speed are the ceiling—not the floor.
X Post Version
To reduce the impact of cache misses, branch stalls, and pipeline bubbles in LLM prefill/decode,
make the hot path regular, vectorized, cache-local, and prefetchable.
Who helps avoid what:
1) Cache misses → mostly runtime + kernel design + hardware caches
2) Branch stalls → kernel design + compiler + branch predictor
3) Pipeline bubbles → compiler + kernel design + hardware scheduling
#AI #CPU #AgenticAI $AMD $INTC $ARM $NVDA