MANISH AI
CPU Inference · Deep Dive

Why LLM Prefill and Decode Suffer from Cache Misses, Branch Stalls, and Pipeline Bubbles on CPUs

Raw FLOPs and clock speed tell only half the story. The real performance battle in CPU-based LLM inference is about how well the processor keeps its execution machinery fed—and how often it is forced to wait.

Prefill · Decode Intel · AMD · ARM Agentic AI Long-form explainer

To reduce the impact of cache misses, branch stalls, and pipeline bubbles in LLM prefill and decode, make the hot path regular, vectorized, cache-local, and prefetchable.

Why This Matters

When people first look at LLM inference, they focus on parameter count, tokens per second, or memory bandwidth. Those all matter. But once you get close to actual CPU execution, the real question becomes: how often is the processor forced to wait?

Those waits show up in three recurring forms:

Prefill and decode suffer from these effects differently, but both are deeply shaped by them.

CPU execution — where time is lost
Three types of CPU stalls in LLM inference A pipeline timeline showing useful work (teal) versus three stall types: cache miss wait (blue), branch recovery (amber), and pipeline bubble (red) work L3 miss work branch work bubble work CPU pipeline time → Useful compute Cache miss wait Branch recovery Pipeline bubble

Prefill and Decode Are Different Workloads

Before analyzing the stall types, it helps to understand that prefill and decode place very different pressure on the CPU. They are not two phases of the same operation—they are fundamentally different computational regimes.

Workload structure: prefill vs decode
Prefill vs decode workload structure Side by side illustration of prefill (many tokens, wide matrix ops) and decode (single token, narrow, sequential) PREFILL T1 T2 T3 T4 T5 T6 Tn Wide GEMM · Good SIMD utilization KV cache built once — sequential write ✓ Parallelism ✓ Cache reuse ✓ Predictable DECODE T_n prev tokens · Narrow GEMV · Poor SIMD fill KV cache grows · Repeated DRAM reads ✗ No parallelism ✗ DRAM pressure ✗ Latency-visible

Prefill

  • All prompt tokens available upfront
  • Large batched matrix operations
  • Good SIMD / AVX utilization
  • Predictable memory access patterns
  • Cache reuse across token batch

Decode

  • One token generated at a time
  • Narrow GEMV — undersized for SIMD
  • Repeated weight reads every step
  • Growing KV cache pressure
  • Stronger sensitivity to tail latency

Cache Misses

CPUs rely on a hierarchy of caches. L1 is smallest and fastest, L2 larger and slower, L3 larger still, and DRAM is dramatically slower than any on-core cache. A cache miss happens when data needed by the core is absent from the expected level and must be fetched from farther away.

CPU cache hierarchy — latency profile
CPU cache hierarchy latency Nested rectangles showing L1, L2, L3 and DRAM with latency and size annotations DRAM ~100 ns · 32–256 GB L3 Cache (shared) ~40 ns · 8–128 MB L2 Cache ~10 ns · 256 KB – 4 MB per core L1 Cache · ~4 ns · 32–64 KB per core slow fast

Why Cache Misses Hurt LLM Inference

LLM inference constantly touches large model weights, per-layer KV cache, activations, and temporary buffers. For a 7B-parameter model at fp16, weights alone occupy ~14 GB—far beyond any L3 cache. Every decode step re-reads a large portion of those weights from DRAM.

Cache Misses in Prefill vs Decode

Prefill is easier to optimize: multiple prompt tokens are processed together, kernels can reuse activation blocks and weight tiles before eviction, and blocked matrix operations achieve much better locality.

Decode is more fragile. It keeps rereading weights with no natural batching. Long context windows increase the KV footprint further, making DRAM trips directly visible in token latency.

How to Reduce the Impact

Branch Stalls

Modern CPUs predict branches before they are resolved so the pipeline stays busy. When the prediction is wrong, the CPU must discard speculative work and restart on the correct path—burning 10–20+ cycles in the process.

Branch misprediction — pipeline recovery cost
Branch misprediction recovery Timeline showing speculative execution, misprediction detection, flush, and restart Fetch Execute fetch: instr A–E BRANCH speculative (wrong) FLUSH wasted restart: correct path execute A–E stalled — waiting for correct fetch resume ~10–20 cycles recovery cost

Branch Stalls in Prefill vs Decode

Prefill often has relatively structured control flow: large matrix operations, stable loops, and predictable token blocks. Well-written prefill kernels keep branch mispredictions low.

Decode has more opportunities for runtime-sensitive behavior to leak inward: sequence-length handling, KV management, dynamic batching policies, and scheduling decisions. If those branches enter the numerical hot loop, performance degrades quickly.

How to Reduce the Impact

Pipeline Bubbles

A pipeline bubble is an underutilized cycle where part of the CPU's execution machinery goes idle because the next instruction or data is not ready. Bubbles can arise from memory waits, branch recovery, long dependency chains, or poor instruction scheduling.

Classic 5-stage pipeline — bubble propagation
Pipeline bubble propagation through stages 5-stage pipeline grid (Fetch, Decode, Execute, Memory, Write) across 8 cycles showing a bubble inserted when a cache miss is detected Fetch Decode Execute Memory Write C1 C2 C3 C4 C5 C6 C7 C8 I1-F I1-D I1-E I1-M (L3 miss — waiting DRAM) I1-W I2-F I2-D BUBBLE BUBBLE BUBBLE I2-E A single L3 cache miss stalls the entire pipeline — downstream instructions queue as bubbles

Pipeline Behavior in Prefill vs Decode

Prefill gives the processor more room to hide bubbles: operations are larger, more vectorized, and richer in instruction-level parallelism. The compiler and kernel can unroll and schedule more effectively.

Decode is more exposed. There is less fresh token-level parallelism, more memory sensitivity, and smaller units of sequential progress. One missing tile or one dependent chain can leave execution units idle for dozens of cycles.

How to Reduce the Impact

What "Regular, Vectorized, Cache-Local, and Prefetchable" Really Means

These four properties form the complete target state for any CPU inference hot path. They are not independent optimizations—they compound each other.

The four optimization properties
Four optimization properties for CPU LLM inference Four labeled cards: regular, vectorized, cache-local, prefetchable, each with a short descriptor Regular Predictable loops, stable shapes, no inner-loop branches Vectorized Full SIMD lane utilization via AVX2/ AVX-512, NEON, SVE Cache-Local Working set stays in L2/L3; tile reuse before eviction Prefetchable Fetch data before it is needed; weights especially amenable ↑ reduces branch stalls ↑ reduces bubbles ↑ reduces cache misses ↑ reduces cache misses

Who Helps Avoid What?

A key insight is that no single actor owns this problem. Performance is co-produced by algorithm structure, runtime decisions, hand-tuned kernels, and CPU microarchitecture working together.

Stall Type Primary Mitigations
Cache misses Runtime layout Kernel design Hardware caches Quantization
Branch stalls Kernel design Compiler shaping Branch predictor
Pipeline bubbles Compiler scheduling Kernel design Out-of-order HW Vectorization

The important takeaway: this is not "the compiler's job" alone. The compiler helps, but algorithm structure, runtime decisions, hand-tuned libraries, and CPU microarchitecture all must work together.

Why Agentic AI Makes This More Important

Agentic systems amplify these issues because they often have long prompts, tool outputs, retrieved context, scratchpads, and multiple model invocations in a single turn.

Agentic workload pressures

  • Prefill cost grows with context volume
  • Decode latency is directly visible to users
  • KV cache grows significantly across turns
  • Orchestration overhead competes with model execution
  • Multiple model invocations multiply every inefficiency

On CPUs, that makes locality, vectorization, and prefetchability even more critical. It is not enough to "have enough cores." The entire hot path must be shaped so that the processor spends more time computing and less time waiting.

average context size multiplier
in agentic vs. single-turn inference

Practical Optimization Checklist

Final Takeaway

Performance comes from how well the system turns large, repetitive tensor workloads into a regular, locality-friendly, vectorized stream that the processor can execute without constantly stalling. Core count and clock speed are the ceiling—not the floor.

X Post Version

Suggested post

To reduce the impact of cache misses, branch stalls, and pipeline bubbles in LLM prefill/decode,
make the hot path regular, vectorized, cache-local, and prefetchable.

Who helps avoid what:
1) Cache misses → mostly runtime + kernel design + hardware caches
2) Branch stalls → kernel design + compiler + branch predictor
3) Pipeline bubbles → compiler + kernel design + hardware scheduling

#AI #CPU #AgenticAI $AMD $INTC $ARM $NVDA