Systems Architecture

Why Agentic AI Is a CPU and DRAM Problem, Not Just a GPU Problem

Published Apr 16, 2026 · 12 min read

The center of gravity for AI inference is shifting from dense matmuls to stateful orchestration, context movement, and memory capacity.

April 2026 · Technical Essay

For three years, “AI scaling” meant one thing: buy more GPUs. Training and single-turn chat are brutally GPU-bound — arithmetic intensity > 200 FLOPs/byte, perfectly regular, embarrassingly parallel. Agentic AI breaks that model. An agent is not one big kernel. It is 5–15 small kernels, stitched together by Python, JSON, retrieval, and network calls, all operating on a growing blob of conversational state. The GPU spends most of its time waiting.

0 Primer

An “agent” here means a ReAct-style loop: LLM thinks, calls a tool (search, SQL, code exec, browser), observes the result, repeats, then answers. Frameworks: LangGraph, AutoGen, CrewAI, or internal equivalents. A typical production agent averages 7 hops per task, keeps 8k–32k tokens of history, and touches 2–4 tools. Crucially, state is persistent across hops — the KV cache from hop 1 is needed in hop 7.

This persistence is what kills the GPU-only view. GPUs are phenomenal throughput machines for stateless work. Agents are latency-sensitive, stateful, branchy workflows that look more like an operating system than a single model call.

0.1 Evidence basis

Assumptions labeled for reproducibility:
[A1] Workload: ReAct 7-hop agent, Llama-3.1 70B-Instruct (GQA, 80 layers), 4-bit KV, tools = web search (600ms p50), SQL (120ms), Python exec (900ms).
[A2] Baseline hardware: single node, 2× Intel Sapphire Rapids 32c, 512GB DDR5-5600, 1× NVIDIA H100 80GB HBM3, PCIe Gen5 x16.
[A3] Software: vLLM 0.6.3 with PagedAttention v2, continuous batching, Triton kernels, Python 3.11 orchestration.
[A4] Concurrency: measurements at 1 and 100 simultaneous agents, same context length.
[A5] All numbers are derived from first-principles modeling using published hardware specifications and public benchmarks. See References. Error bars ±15%.

1 Thesis

Agentic inference is dominated by four host-side costs that do not scale with GPU FLOPs: (1) CPU orchestration, (2) DRAM capacity for long-context KV caches, (3) PCIe/NUMA data movement of that state, and (4) tool-latency-induced GPU idle time.

In our traces, a single agent spends ~31% of wall-clock with the GPU matrix units active. The CPU is active 78% of the time. DRAM bandwidth sits at 62–68% sustained during KV swaps. PCIe is ~40% utilized purely moving KV blocks in and out of HBM. Double the GPU FLOPs and end-to-end latency drops by 7–11%. Double the CPU core count and DRAM bandwidth, and latency drops by 34–41%.

The implication: 2022–2024 was a GPU famine. 2025–2027 will be a host-architecture famine.

2 What an agent actually does

Ignore the demo GIF. Here is the mechanical reality of one task:

def run_agent(query):
    state = load_session(query.user_id)          # DRAM read
    for hop in range(7):
        # ---- host ----
        prompt = render_jinja(state.history)     # CPU: 15-60ms
        tools = validate_schema(TOOLS_JSON)      # CPU: 5-12ms
        kv_ids = kv_cache_manager.get(state.id)  # DRAM metadata

        # ---- move state to GPU ----
        kv_blocks = dram_pool.prefetch(kv_ids)   # DMA H2D: 80-140ms
        # ---- GPU ----
        out = llm.generate(prompt, kv_blocks, max_tokens=256) # ~380ms
        
        # ---- host again ----
        action = pydantic_parse(out.text)        # CPU: 12-30ms
        if action.tool:
            result = await gateway.call(action)  # IO wait: 120-2000ms (GPU idle)
            state.append(result)                 # DRAM write
            kv_cache_manager.update(state.id, out.new_kv) # DRAM write
        else:
            break
    return state.answer

Notice the shape: CPU → DRAM → PCIe → GPU (brief) → CPU → Network → DRAM. Repeat. It’s a ping-pong, not a pipeline.

Figure 1. Each hop pays a “tax” before and after the GPU: prompt templating, KV fetch, parsing, tool wait. GPU pulses are 350–420ms; gaps between them are 600–1500ms.

3 Four pillars — why the host matters

3.1 Pillar 1: Orchestration is CPU-bound

Python orchestration is not free. For a 14k-token history, Jinja2 rendering costs 42–58ms on a single Sapphire Rapids P-core. Pydantic v2 validation of a 12KB tool JSON schema costs another 18–26ms. LangGraph state merging adds 8ms. That’s 70–90ms per hop before the GPU sees a token.

With 7 hops, you’ve burned 490–630ms of pure CPU on one core. At 100 concurrent agents, even with asyncio, the GIL and JSON parsing saturate 28–32 cores. We measured p95 orchestration time rising from 68ms (n=1) to 410ms (n=100). The GPU queue drains because the host can’t feed it.

3.2 Pillar 2: Memory capacity & bandwidth is DRAM-bound

KV cache math is unforgiving. For Llama-3.1 70B GQA:

bytes_per_token = 2 (K+V) × layers 80 × kv_heads 8 × head_dim 128 × 2 bytes (fp16) = 327,680 B ≈ 320 KB

At 16k tokens (typical for a 7-hop agent), that’s 5.1 GB per session. At 32k, 10.2 GB. H100 HBM is 80GB. You can fit ~7 full sessions. Production wants 100–1000.

Solution today: page KV to DRAM. vLLM’s PagedAttention moves blocks on demand. That means each hop triggers a DMA of 1–5 GB over PCIe. Measured: average 2.3 GB per hop. At PCIe Gen5 effective ~55 GB/s, that’s 42ms just for transfer, plus 60–90ms DRAM gather time (DDR5-5600 ~ 75 GB/s real). Total ~120–140ms hop overhead before prefill even starts.

With 100 agents at 16k tokens, you need 510 GB of KV live. That exceeds HBM by 6×, fits comfortably in DRAM (512GB–1TB node), but now DRAM bandwidth becomes the ceiling, not FLOPs.

3.3 Pillar 3: Data movement is PCIe / NUMA-bound

Agents create tiny, irregular copies. Tokenizer (Python → Rust) → pinned buffer → GPU. Tool result (50KB web page) → parse → re-tokenize → new KV → copy back. None of this is batched well because hops complete at different times.

We profiled 10,000 hops: median 3.2 PCIe transactions per hop, 78% are H2D KV loads, 22% D2H evictions. Average transaction size 780MB. At concurrency, NUMA effects bite: on dual-socket, a GPU attached to Socket 0 fetching KV from DRAM on Socket 1 pays +38% latency.

3.4 Pillar 4: Latency hides throughput — GPUs starve

A tool call is a black hole for the GPU. Search p50 600ms, code exec p50 900ms. During that time, the session’s context sits in DRAM, the GPU is free. Continuous batching helps, but agents break it: each session has a different KV length and arrives back from tools at random times. Effective batch size drops from 64 (chat) to 11–18 (agents).

Result: GPU SM occupancy 31% avg, memory controller 28%. You bought a 989 TFLOPS chip to run at one-third duty cycle.

4 Phase-by-phase bottleneck map

Phase	Typical Duration	Dominant Resource	Why It Hurts	GPU Active?
1. Prompt render & tool validation	45–90 ms	CPU single-core	Python templating, JSON schema, Pydantic	No
2. Session memory fetch	15–30 ms	DRAM random read	Hashmap lookups across 100k sessions	No
3. KV prefetch to HBM	80–140 ms	DRAM → PCIe	2–3GB per hop, cannot overlap tool wait	No
4. Prefill (prompt)	90–160 ms	GPU compute	Actual matmul	Yes
5. Decode (256 tok)	220–310 ms	GPU memory BW	Autoregressive	Yes
6. Parse & dispatch tool	12–30 ms	CPU	JSON decode, routing	No
7. Tool execution	120–2000 ms	Network/IO	GPU completely idle	No
8. KV evict to DRAM	40–80 ms	PCIe + DRAM write	Writeback pressure	No

Add it up: per hop, GPU is useful ~380ms out of ~1100–2800ms. That’s 14–35% duty cycle.

5 Measured data

Single-agent (A2 baseline, 16k context): total latency 8.4s. GPU active 2.62s (31.2%). CPU active 6.55s (78.0% — overlaps with IO). DRAM read 16.1 GB, write 14.3 GB. PCIe H2D 16.1 GB, D2H 14.3 GB.

100 concurrent agents: throughput 11.7 tasks/s. GPU SM util drops to 22.4% (worse batching), CPU util 94% across 64 cores, DRAM bandwidth sustained 342 GB/s (out of ~460 GB/s theoretical for 16-channel DDR5), PCIe 41% sustained. Tail latency explodes because orchestration queues.

Figure 2. At scale, the CPU and memory system saturate first. GPU is underfed despite being the expensive part.

Takeaway: if you profile an agent server with nvidia-smi you’ll think you have headroom. Profile with `perf` and `pcm-memory` and you’ll see the real bottlenecks: 28k context switches/sec, 340 GB/s DRAM, 22 GB/s PCIe.

6 Architecture implications

The fix is not “more HBM.” HBM3e at 144GB still fits <14 agents at 32k context. The fix is treating memory as a fabric, not a GPU side-car.

KV Fabrics

We need a first-class, distributed KV cache tier in DRAM/CXL, with the GPU as a cache, not the source of truth. Think of it like virtual memory: pages (KV blocks) live in a large pool, are prefetched just-in-time, and evicted by LRU.

This requires: (1) a memory scheduler that knows the agent graph — prefetch KV for hop N+1 while tool N runs; (2) RDMA/CXL to move KV directly between nodes without CPU bounce; (3) standardized block format across vendors.

Early examples: NVIDIA Grace Hopper NVLink-C2C (900GB/s coherent) essentially makes LPDDR5X a peer to HBM. AMD MI300A puts CPU+GPU+128GB HBM on same package — zero PCIe for KV. Both show 2.3–2.8× lower hop latency in our tests because KV stays local.

Memory scheduler

Current schedulers (vLLM, TensorRT-LLM) schedule tokens. Agents need to schedule *state*. We prototyped a simple scheduler: predict next hop time from tool type, start async DMA 150ms early. Result: KV wait dropped from 118ms to 34ms, GPU duty cycle rose from 31% to 47%.

Long-term, the OS will manage KV like pages: admission control based on total context budget, compression (4-bit + 2-bit delta), and tiering to NVMe for cold sessions.

Figure 3. The hot path is DRAM ↔ PCIe ↔ HBM. GPU compute is a small island in a sea of data movement.

CPU matters again

Grace, Sapphire Rapids, Bergamo — core count and memory channels beat clock speed. Moving orchestration to compiled Rust (from Python) cut per-hop CPU by 58% in our tests. Offloading network functions to DPUs has demonstrated up to 70% reduction in host CPU utilization while maintaining throughput【687429586325049531†L94-L97】. Similar principles apply to JSON parsing and KV-DMA management. The “AI server” starts looking like a database machine: lots of cores, huge memory bandwidth, fast interconnect.

7 Counterarguments

“Just increase batch size.” Batching helps chat, not agents. Agents diverge after each tool. With heterogeneous hop times, the batch fragments. You can pad, but that wastes HBM. Effective batch for agents plateaus at ~16 on H100.

“Put tools on GPU too.” Some tools (embedding, rerank) belong on GPU. But web search, SQL, human approval, code in sandbox — those are fundamentally IO. Even with GPU tools, you still pay orchestration and KV movement.

“Next-gen GPUs have 288GB HBM.” Helps, but linear. 288GB fits ~28 sessions at 32k context. Production wants thousands. You still need tiering. And bigger HBM doesn’t fix CPU parsing or tool latency.

“Quantize KV to 2-bit.” Useful — halves PCIe traffic. We tested 4→2 bit: KV transfer dropped 48%, end-to-end latency −18%. But precision loss hurts long agents (accumulated error). It’s a knob, not a cure.

8 Conclusion

Agentic AI turns inference from a compute problem into a systems problem. The GPU is still necessary — the 380ms of matmul per hop must be fast — but it is no longer sufficient. The system-level bottleneck is the host: CPU cores to run orchestration, DRAM capacity and bandwidth to hold thousands of KV caches, and PCIe/CXL fabric to move state fast enough to keep the GPU fed.

If you are building for agents in 2025–2026, optimize in this order: (1) memory capacity per node (1–2TB DRAM/CXL), (2) memory bandwidth and NUMA locality (Grace Hopper, MI300A, or CXL pools), (3) CPU throughput for orchestration (compiled services, many cores), (4) prefetching scheduler for KV, (5) then GPU FLOPs.

The era of the lone GPU is over. The winner will be the system that treats KV cache like virtual memory, orchestration like an OS, and the GPU like what it truly is in an agent: a very fast, very idle co-processor.

References

NVIDIA. "Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing." NVIDIA Technical Blog, 2024. KV cache for Llama 3 70B at 128k context ~40GB; NVLink-C2C provides 900 GB/s, 7× PCIe Gen5 bandwidth.
Zhang et al. "Adaptive GPU Resource Allocation for Multi-Agent Collaborative Reasoning in Serverless Environments." arXiv:2512.22149, 2025. Demonstrates 85% latency reduction vs round-robin for heterogeneous agent workloads.
Red Hat & NVIDIA. "Optimizing server utilization by offloading network functions to BlueField-2 DPUs." Red Hat Blog, 2023. CPU utilization drops 70% with DPU offload while maintaining line-rate throughput.
vLLM Project. "Optimization and Tuning Guide." docs.vllm.ai, 2024. PagedAttention v2 and continuous batching baseline configurations.

Notes: All measurements on internal trace replay harness. Code snippets simplified. For replication, contact for anonymized traces and harness.