Every essay in this corpus has treated inference as one model, one request. The real 2026 workload is a directed graph of model calls. When the unit of scheduling shifts from a request to a pipeline, everything we know about KV locality, memory residency, and latency SLOs must be rebuilt from scratch.
Every piece of inference infrastructure built over the last three years — vLLM, TGI, TensorRT-LLM, SGLang, the prefill/decode disaggregation proposals, the KV cache eviction policies, the MCOS framework — was designed with the same implicit assumption: a request is a conversation with a single model. A user sends a prompt. A model responds. The serving system manages one KV cache, one GPU pool, one latency SLO.
That assumption is now wrong for a large and growing fraction of production AI workload.
In 2026, the typical interaction with a frontier AI system is not a single model call. It is a directed graph of model calls: an orchestrator that decomposes tasks, a planner that sequences steps, a retrieval system that finds context, a reasoning model that evaluates options, a code executor that validates outputs, and a verifier that checks the result. These are not the same model. They are often different models, different sizes, different precisions, running on different hardware pools, with different latency profiles.
The serving infrastructure has not caught up. Each model call is still scheduled independently. Each KV cache is still scoped to one request on one model. Each latency SLO is still measured per model hop rather than end-to-end. The consequence is that agent pipelines are being served by infrastructure that is individually optimised and globally naive — and the global naivety is often the dominant cost.
The core argument: agent pipelines are not a composition of single-model serving problems. They are a new topology with new resource pathologies that cannot be solved by optimising individual hops in isolation.
Before we can reason about the infrastructure, we need a precise characterisation of the workload. An agent call graph is a directed acyclic graph (DAG) G = (V, E) where:
In practice, production agent graphs have the following structural characteristics:
| Structural Property | Typical range (2026) | Infrastructure implication |
|---|---|---|
| Depth (sequential hops) | 4–15 hops | Latency multiplies; end-to-end SLO requires per-hop budgeting |
| Width (parallel branches) | 2–8 concurrent calls | Memory pressure multiplies; batch contamination across branches |
| Model diversity | 2–5 distinct models | KV caches are disjoint; no direct reuse across model boundaries |
| Context growth rate | 5–50× from root to leaf | KV size at leaf nodes is much larger than at root; HBM pressure peaks late |
| Speculation (branching) | 1.5–3× the committed token count | Memory must be allocated for paths that may be discarded |
| Retrieval injection size | 4K–256K tokens per RAG hop | Each retrieval hop is a large-context prefill; prefill cost dominates agent cost |
In a single-model serving system, KV caching across requests is the primary throughput lever. When two requests share a common prefix — a system prompt, a document, a code context — the KV cache for that prefix can be shared, avoiding a redundant prefill. This is the foundation of prefix caching in vLLM, SGLang's RadixAttention, and every modern KV sharing scheme.
In an agent pipeline, this locality property breaks at every edge in the call graph. When the task planner sends its output to the reasoning model, the reasoning model starts a new KV cache from scratch. The KV state of the task planner is not transferable — it is computed by a different model with a different weight matrix, a different attention head structure, and potentially a different vocabulary. KV vectors are model-specific. They cannot be reused across model boundaries.
This means that every edge in the agent call graph is a KV locality reset. The context that took many tokens and many prefill cycles to encode in the upstream model must be re-encoded from text (or token IDs) in the downstream model. The bytes are moved out of the upstream model's KV cache, serialised as tokens, transmitted across the pipeline, and re-prefilled into the downstream model.
The implications are significant. In a single-model serving system, prefill is a small fraction of total cost (decode dominates for long outputs). In an agent pipeline, prefill is re-run at every hop, and the context at each hop includes the accumulated output of all prior hops plus any retrieved content. By the time the pipeline reaches the reasoning model, the prefill cost alone can dwarf the decode cost of the entire original request.
A serving system optimised for TTFT (time-to-first-token) and TBT (time-between-tokens) at the individual request level has a simple latency model. In an agent pipeline, these per-hop latencies compound.
The 4.2-second figure assumes zero queuing delay at each model and no contention for GPU capacity. In a production cluster serving hundreds of concurrent agent pipelines, each hop waits in a queue. The queuing latency at each hop is independent of the queuing latency at the prior hop — which means tail latency for the full pipeline is the sum of per-hop tail latencies, not the maximum. This is a much worse distribution than single-model serving.
A p99 TTFT of 300ms at each hop produces a p99 end-to-end latency of approximately 2.1 seconds (7 × 300ms) even if every hop individually meets its SLO. The pipeline SLO is a different, harder problem than the per-hop SLO.
Many agent architectures include speculative execution: launching multiple branches in parallel and committing to the one that returns the best result. This is correct algorithmically — it reduces end-to-end latency by parallelising uncertainty. But it creates a severe memory management problem.
When a scheduler launches N branches speculatively, it must allocate KV cache space for all N branches simultaneously. If only one branch commits, the KV pages allocated for the N-1 discarded branches must be freed. In a system using PagedAttention or a similar block-allocated KV management scheme, these freed pages become fragmented — they are scattered throughout the HBM in non-contiguous blocks that may not be immediately re-usable for the next incoming request.
This fragmentation is not hypothetical. It is the same class of problem that motivated PagedAttention in the first place — but agent pipelines make it significantly worse because the branching and commitment patterns are irregular and hard to predict ahead of time.
If we apply the byte-movement accounting from the previous essay ("The True Cost of a Token") to an agent pipeline, the picture changes materially. In single-model serving, the dominant movement is weights (read once per decode step, amortised over batch) and KV cache (read once per decode step per request). In an agent pipeline, two new movement costs appear:
The three problems above — KV locality loss, multiplicative latency compounding, and speculative memory fragmentation — are not solvable by improving individual model serving. They require a scheduling abstraction that spans the full pipeline.
I will call this the Agent Memory Fabric (AMF). The AMF is not a new hardware layer. It is a control-plane abstraction that has three responsibilities:
The AMF does not require cross-model KV transfer (which is physically impossible without re-projection through the model). It does not require a shared global memory address space across GPUs. It does not require changes to individual model serving engines. It is a scheduling layer, not a hardware layer — it sits above vLLM, TGI, and similar systems and orchestrates their resource allocation based on pipeline-level context.
A natural question: could KV vectors ever be reused across model boundaries, if the models share architectural properties? The answer depends on what kind of sharing we mean.
For the majority of production agent pipelines, where different models serve different roles in the pipeline, cross-model KV sharing is not possible. The architecture must instead focus on minimising the re-prefill cost — through text-level prefix caching within each model, through aggressive context compression before transmission, and through prefill scheduling that prioritises pipeline-critical hops.
In a serving cluster with many GPU nodes, each serving one or more models, the placement of a model call matters for latency. A call routed to a GPU node where the system prompt and document context are already warm in HBM saves the prefill cost. A call routed to a cold node pays the full prefill.
Current load balancers for model serving use simple round-robin or least-loaded routing. They do not model KV warmth. The AMF should extend routing to include KV locality as a first-class routing signal:
route_request(pipeline_id, hop_id, context_hash):
# Find GPU nodes with warm KV for this context
warm_nodes = kv_registry.query(model_id, context_hash)
if warm_nodes and min(warm_nodes.queue_depth) < KV_REUSE_THRESHOLD:
# Route to warm node; save prefill
return route_to(warm_nodes.least_loaded())
else:
# Route to least-loaded node; pay prefill
cold_node = gpu_pool.least_loaded(model_id)
kv_registry.register_prefill(cold_node, model_id, context_hash)
return route_to(cold_node)
The KV_REUSE_THRESHOLD controls the tradeoff between KV warmth and queue
depth. A tight threshold favours latency (route to warm even if slightly busier).
A loose threshold favours throughput (always route to least-loaded). For agent pipelines
on the critical path, a tight threshold is almost always preferable: the cost of a queue
wait is additive across all downstream hops, while the benefit of KV reuse is multiplicative.
A well-designed single-model serving system can achieve p99 TTFT under 300ms for short contexts and p99 TBT under 30ms. These are hard numbers, achieved through careful batch management, preemption policies, and priority queuing.
In an agent pipeline with N hops, what is the p99 end-to-end latency? Under the independence assumption (hop latencies are independent), it is approximately:
The correct way to manage E2E SLO in an agent pipeline is to work backward from the end-to-end budget and allocate per-hop latency budgets based on criticality and expected cost. The AMF should implement deadline propagation: each hop receives a deadline derived from the E2E SLO, and the serving scheduler for that model uses the deadline as a priority signal rather than first-in-first-out ordering.
The MCOS framework — memory placement, movement, residency, reuse, admission, and eviction as first-class scheduling objects — was designed with single-model inference in mind. The concepts extend naturally to agent pipelines, but with a change of scope.
| MCOS concept | Single-model scope | Agent pipeline scope |
|---|---|---|
| Placement | Which tier (SRAM/HBM/NVMe) should a KV block reside in? | Which GPU node, and which memory tier on that node, should a pipeline's KV state reside in, given the subsequent hops it will feed? |
| Residency | How long should a KV block stay in HBM before eviction? | How long should a hop's output context stay warm in the downstream model's prefix cache, given the probability of future pipeline calls sharing that context? |
| Reuse | Can this KV block serve multiple concurrent requests? | Can this pipeline's intermediate outputs be shared with other pipelines that share the same sub-graph? Tracked at text-hash level across model boundaries. |
| Admission | Should a new request enter the fast KV tier? | Should a new agent pipeline be admitted to the cluster, given current pipeline depth, speculative branch count, and HBM fragmentation state? |
| Movement | When and how should KV blocks be moved between tiers? | When should cross-hop context be compressed before transmission? When should upstream KV state be proactively staged on the downstream GPU before the hop begins? |
The extension of MCOS to agent scope does not require changing the core MCOS abstractions. It requires extending the information available to the MCOS scheduler: instead of seeing individual request objects, it must see pipeline DAGs with known structure, estimated hop costs, and cross-hop memory dependencies.
MCOS at single-model scope is a memory scheduler. MCOS at agent scope is a topology planner with memory awareness. The algebra is the same; the object it operates on is a graph, not a request.
The AI industry is in the middle of an infrastructure debt accumulation. Application developers are building agent pipelines using frameworks (LangChain, AutoGen, LlamaIndex, CrewAI, Claude's computer use API) that abstract away the infrastructure entirely. Each model call goes through an HTTP endpoint. The endpoint dispatches to a vLLM or TGI instance. The instance manages its own KV cache. No component in this stack has a model of the pipeline.
The consequences are visible in production. Agent pipelines that should take 2–3 seconds take 8–12 seconds because of queuing at the RAG hop. Pipelines that should cost $0.02 cost $0.15 because large contexts are being re-prefilled on cold nodes when warm nodes exist. Pipelines that should maintain a 4-second p99 SLO miss it consistently during peak traffic because per-model SLOs are each met but their sum is not.
The infrastructure that will fix this has three components, in order of likely emergence:
Single-model serving infrastructure took three years to mature from naive batching to PagedAttention, continuous batching, and speculative decoding. Agent-topology infrastructure is at year zero. The problems are well-defined. The abstractions are transferable from single-model work. What is missing is the systems investment.
The agent pipeline is not a new product. It is a new topology. And topology changes the machine — not the model, not the algorithm, but the machine that holds the memory, routes the bytes, and pays the power bill. If the machine does not understand the topology it is serving, it will serve it expensively and slowly, even as the models themselves become cheaper and faster.
← All writings
© 2026 Manish KL. All rights reserved.
Systems architecture notes on infrastructure boundaries.