CPU Inference · Deep Dive

Why LLM Prefill and Decode Suffer from Cache Misses, Branch Stalls, and Pipeline Bubbles on CPUs

Raw FLOPs and clock speed tell only half the story. The real performance battle in CPU-based LLM inference is about how well the processor keeps its execution machinery fed—and how often it is forced to wait.

Prefill · Decode Intel · AMD · ARM Agentic AI Long-form explainer

To reduce the impact of cache misses, branch stalls, and pipeline bubbles in LLM prefill and decode, make the hot path regular, vectorized, cache-local, and prefetchable.

Why This Matters

When people first look at LLM inference, they focus on parameter count, tokens per second, or memory bandwidth. Those all matter. But once you get close to actual CPU execution, the real question becomes: how often is the processor forced to wait?

Those waits show up in three recurring forms:

Cache misses — the data you want is not in the cache level you hoped for, so you fall back to a slower level.
Branch stalls — the CPU predicts the wrong control-flow path and has to recover.
Pipeline bubbles — parts of the execution pipeline sit idle because the next useful instruction or data is not ready.

Prefill and decode suffer from these effects differently, but both are deeply shaped by them.

CPU execution — where time is lost

Prefill and Decode Are Different Workloads

Before analyzing the stall types, it helps to understand that prefill and decode place very different pressure on the CPU. They are not two phases of the same operation—they are fundamentally different computational regimes.

Workload structure: prefill vs decode

Prefill

All prompt tokens available upfront
Large batched matrix operations
Good SIMD / AVX utilization
Predictable memory access patterns
Cache reuse across token batch

Decode

One token generated at a time
Narrow GEMV — undersized for SIMD
Repeated weight reads every step
Growing KV cache pressure
Stronger sensitivity to tail latency

Cache Misses

CPUs rely on a hierarchy of caches. L1 is smallest and fastest, L2 larger and slower, L3 larger still, and DRAM is dramatically slower than any on-core cache. A cache miss happens when data needed by the core is absent from the expected level and must be fetched from farther away.

CPU cache hierarchy — latency profile

Why Cache Misses Hurt LLM Inference

LLM inference constantly touches large model weights, per-layer KV cache, activations, and temporary buffers. For a 7B-parameter model at fp16, weights alone occupy ~14 GB—far beyond any L3 cache. Every decode step re-reads a large portion of those weights from DRAM.

Cache Misses in Prefill vs Decode

Prefill is easier to optimize: multiple prompt tokens are processed together, kernels can reuse activation blocks and weight tiles before eviction, and blocked matrix operations achieve much better locality.

Decode is more fragile. It keeps rereading weights with no natural batching. Long context windows increase the KV footprint further, making DRAM trips directly visible in token latency.

How to Reduce the Impact

Use cache-friendly tensor layouts (row-major with SIMD alignment)
Block and tile matrix operations (keep working set ≤ L2/L3 per tile)
Quantize weights so more fit into higher cache levels (INT8/INT4)
Keep NUMA locality correct — pin threads to their memory domain
Separate hot data from cold data (weights vs activations)
Stage predictable weight tiles before they are needed (SW prefetch)

Branch Stalls

Modern CPUs predict branches before they are resolved so the pipeline stays busy. When the prediction is wrong, the CPU must discard speculative work and restart on the correct path—burning 10–20+ cycles in the process.

Branch misprediction — pipeline recovery cost

Branch Stalls in Prefill vs Decode

Prefill often has relatively structured control flow: large matrix operations, stable loops, and predictable token blocks. Well-written prefill kernels keep branch mispredictions low.

Decode has more opportunities for runtime-sensitive behavior to leak inward: sequence-length handling, KV management, dynamic batching policies, and scheduling decisions. If those branches enter the numerical hot loop, performance degrades quickly.

How to Reduce the Impact

Move policy logic outside tight compute loops
Use specialized kernels for common cases (e.g. fixed sequence lengths)
Prefer table-driven decisions over unpredictable inner-loop branching
Keep the hot path regular and shape-stable across iterations

Pipeline Bubbles

A pipeline bubble is an underutilized cycle where part of the CPU's execution machinery goes idle because the next instruction or data is not ready. Bubbles can arise from memory waits, branch recovery, long dependency chains, or poor instruction scheduling.

Classic 5-stage pipeline — bubble propagation

Pipeline Behavior in Prefill vs Decode

Prefill gives the processor more room to hide bubbles: operations are larger, more vectorized, and richer in instruction-level parallelism. The compiler and kernel can unroll and schedule more effectively.

Decode is more exposed. There is less fresh token-level parallelism, more memory sensitivity, and smaller units of sequential progress. One missing tile or one dependent chain can leave execution units idle for dozens of cycles.

How to Reduce the Impact

Vectorize aggressively — fill the SIMD units fully (AVX-512, SVE2)
Unroll hot loops carefully to expose more independent work
Overlap loads with compute wherever the dependency graph allows
Reduce unnecessary data dependencies in kernel design
Keep more independent work in-flight for out-of-order execution to exploit

What "Regular, Vectorized, Cache-Local, and Prefetchable" Really Means

These four properties form the complete target state for any CPU inference hot path. They are not independent optimizations—they compound each other.

The four optimization properties

Who Helps Avoid What?

A key insight is that no single actor owns this problem. Performance is co-produced by algorithm structure, runtime decisions, hand-tuned kernels, and CPU microarchitecture working together.

Stall Type	Primary Mitigations
Cache misses	Runtime layout Kernel design Hardware caches Quantization
Branch stalls	Kernel design Compiler shaping Branch predictor
Pipeline bubbles	Compiler scheduling Kernel design Out-of-order HW Vectorization

The important takeaway: this is not "the compiler's job" alone. The compiler helps, but algorithm structure, runtime decisions, hand-tuned libraries, and CPU microarchitecture all must work together.

Why Agentic AI Makes This More Important

Agentic systems amplify these issues because they often have long prompts, tool outputs, retrieved context, scratchpads, and multiple model invocations in a single turn.

Agentic workload pressures

Prefill cost grows with context volume
Decode latency is directly visible to users
KV cache grows significantly across turns
Orchestration overhead competes with model execution
Multiple model invocations multiply every inefficiency

On CPUs, that makes locality, vectorization, and prefetchability even more critical. It is not enough to "have enough cores." The entire hot path must be shaped so that the processor spends more time computing and less time waiting.

4×

average context size multiplier
in agentic vs. single-turn inference

Practical Optimization Checklist

Use quantized weights (INT8 / INT4) where acceptable for quality
Use cache-blocked GEMM and attention kernels — keep tiles in L2/L3
Keep hot loops branch-light — move policy logic outside numerical kernels
Pin threads to physical cores; benchmark SMT/HT on and off for your workload
Keep NUMA locality correct — run model weights in the socket's local memory
Separate weight traffic from KV cache traffic where possible
Prefetch predictable weight tiles with software prefetch instructions
Minimize decode-path randomness — stable kernel shapes help branch prediction
Use AVX-512 / SVE2 where available; verify vectorization with compiler reports

Final Takeaway

Performance comes from how well the system turns large, repetitive tensor workloads into a regular, locality-friendly, vectorized stream that the processor can execute without constantly stalling. Core count and clock speed are the ceiling—not the floor.

Why LLM Prefill and Decode Suffer from Cache Misses, Branch Stalls, and Pipeline Bubbles on CPUs

Why This Matters

Prefill and Decode Are Different Workloads

Prefill

Decode

Cache Misses

Why Cache Misses Hurt LLM Inference

Cache Misses in Prefill vs Decode

How to Reduce the Impact

Branch Stalls

Branch Stalls in Prefill vs Decode

How to Reduce the Impact

Pipeline Bubbles

Pipeline Behavior in Prefill vs Decode

How to Reduce the Impact

What "Regular, Vectorized, Cache-Local, and Prefetchable" Really Means

Who Helps Avoid What?

Why Agentic AI Makes This More Important

Agentic workload pressures

Practical Optimization Checklist

Final Takeaway

X Post Version