Agent Systems Inference Architecture Memory Policy Multi-Model Scheduling

The Agent Topology Problem: Inference Scheduling Across Model Boundaries

Published Apr 16, 2026 · 23 min read

Every essay in this corpus has treated inference as one model, one request. The real 2026 workload is a directed graph of model calls. When the unit of scheduling shifts from a request to a pipeline, everything we know about KV locality, memory residency, and latency SLOs must be rebuilt from scratch.

By MANISH AI · April 2026 · ~20 min read · Systems Essay

Abstract. Agentic AI workloads — orchestrators calling planners calling retrievers calling verifiers — are not well-served by inference infrastructure designed for isolated single-model requests. This essay argues that agent pipelines constitute a new topology problem: the memory, latency, and scheduling machinery of modern serving must be extended across model call boundaries. We define the agent call graph formally, characterise its resource profile, identify three structural problems it creates (KV locality loss, cross-hop latency compounding, and memory fragmentation under speculation), and propose a scheduling abstraction — the Agent Memory Fabric (AMF) — that treats the entire pipeline as a single scheduling domain. The goal is not a research proposal. It is a systems argument for why the next generation of inference infrastructure must become topology-aware.

3–12× latency compounding from naive sequential model chaining vs. topology-aware scheduling

40–60% KV reuse opportunity lost when model-call boundaries reset the cache

~7 average model hops in a production coding agent pipeline (2026)

2–5× HBM memory fragmentation increase under speculative multi-path agent execution

Contents

The serving assumption that is now wrong
What an agent call graph actually looks like
Problem 1: KV locality does not cross model boundaries
Problem 2: Latency compounds multiplicatively
Problem 3: Speculative branching fragments memory
The anatomy of agent memory movement
The Agent Memory Fabric: a scheduling abstraction for pipelines
Cross-model KV reuse: when it is possible and when it is not
Topology-aware routing: placing model calls near their memory
SLO composition across a pipeline
How MCOS extends to agent scope
The infrastructure debt of the agent era

1. The serving assumption that is now wrong

Every piece of inference infrastructure built over the last three years — vLLM, TGI, TensorRT-LLM, SGLang, the prefill/decode disaggregation proposals, the KV cache eviction policies, the MCOS framework — was designed with the same implicit assumption: a request is a conversation with a single model. A user sends a prompt. A model responds. The serving system manages one KV cache, one GPU pool, one latency SLO.

That assumption is now wrong for a large and growing fraction of production AI workload.

In 2026, the typical interaction with a frontier AI system is not a single model call. It is a directed graph of model calls: an orchestrator that decomposes tasks, a planner that sequences steps, a retrieval system that finds context, a reasoning model that evaluates options, a code executor that validates outputs, and a verifier that checks the result. These are not the same model. They are often different models, different sizes, different precisions, running on different hardware pools, with different latency profiles.

The serving infrastructure has not caught up. Each model call is still scheduled independently. Each KV cache is still scoped to one request on one model. Each latency SLO is still measured per model hop rather than end-to-end. The consequence is that agent pipelines are being served by infrastructure that is individually optimised and globally naive — and the global naivety is often the dominant cost.

The core argument: agent pipelines are not a composition of single-model serving problems. They are a new topology with new resource pathologies that cannot be solved by optimising individual hops in isolation.

2. What an agent call graph actually looks like

Before we can reason about the infrastructure, we need a precise characterisation of the workload. An agent call graph is a directed acyclic graph (DAG) G = (V, E) where:

Each vertex v ∈ V is a model call: a (model, prompt, parameters) tuple
Each edge e ∈ E is a data dependency: the output of one call becomes (part of) the input to another
The graph may include parallel branches (fan-out), joins (fan-in), conditional branches (speculation), and feedback loops (iteration)

In practice, production agent graphs have the following structural characteristics:

Structural Property	Typical range (2026)	Infrastructure implication
Depth (sequential hops)	4–15 hops	Latency multiplies; end-to-end SLO requires per-hop budgeting
Width (parallel branches)	2–8 concurrent calls	Memory pressure multiplies; batch contamination across branches
Model diversity	2–5 distinct models	KV caches are disjoint; no direct reuse across model boundaries
Context growth rate	5–50× from root to leaf	KV size at leaf nodes is much larger than at root; HBM pressure peaks late
Speculation (branching)	1.5–3× the committed token count	Memory must be allocated for paths that may be discarded
Retrieval injection size	4K–256K tokens per RAG hop	Each retrieval hop is a large-context prefill; prefill cost dominates agent cost

Fig 1. A production coding agent pipeline represented as a directed call graph. Seven model hops across four model families. Context size grows from 2K at the orchestrator to 256K at the RAG retrieval node. KV caches are scoped to individual nodes — no locality is preserved across edges. Each edge is a serialisation point.

3. Problem 1: KV locality does not cross model boundaries

In a single-model serving system, KV caching across requests is the primary throughput lever. When two requests share a common prefix — a system prompt, a document, a code context — the KV cache for that prefix can be shared, avoiding a redundant prefill. This is the foundation of prefix caching in vLLM, SGLang's RadixAttention, and every modern KV sharing scheme.

In an agent pipeline, this locality property breaks at every edge in the call graph. When the task planner sends its output to the reasoning model, the reasoning model starts a new KV cache from scratch. The KV state of the task planner is not transferable — it is computed by a different model with a different weight matrix, a different attention head structure, and potentially a different vocabulary. KV vectors are model-specific. They cannot be reused across model boundaries.

This means that every edge in the agent call graph is a KV locality reset. The context that took many tokens and many prefill cycles to encode in the upstream model must be re-encoded from text (or token IDs) in the downstream model. The bytes are moved out of the upstream model's KV cache, serialised as tokens, transmitted across the pipeline, and re-prefilled into the downstream model.

Prefill cost at each agent hop

prefill_FLOPs(hop_i) = 2 × L × d² × context_tokens(hop_i)

For a 7-hop pipeline, context growing 2× per hop from 2K:
hop 1 (orchestrator): 2K tokens → ~71B FLOPs
hop 2 (planner): 4K tokens → ~142B FLOPs
hop 3 (retriever): 256K tokens → ~9.1T FLOPs ← dominates
hop 4 (reasoning): 32K tokens → ~1.1T FLOPs
hop 5 (code gen): 16K tokens → ~571B FLOPs
hop 6 (verifier): 8K tokens → ~285B FLOPs
hop 7 (executor): overhead

Total pipeline prefill: ~11.2T FLOPs
vs. single 2K-context request prefill: ~71B FLOPs
Agent pipeline prefill is ~158× the cost of the root request.

The implications are significant. In a single-model serving system, prefill is a small fraction of total cost (decode dominates for long outputs). In an agent pipeline, prefill is re-run at every hop, and the context at each hop includes the accumulated output of all prior hops plus any retrieved content. By the time the pipeline reaches the reasoning model, the prefill cost alone can dwarf the decode cost of the entire original request.

The reuse opportunity that is being wasted

If two agent pipelines share a common sub-graph — the same orchestrator prompt, the same retrieved documents, the same code context — there is, in principle, enormous KV reuse opportunity across those pipelines even though individual model boundaries prevent cross-model reuse. A topology-aware scheduler could identify this shared sub-graph, cache the text serialisation of each hop's output, and serve subsequent pipelines by prefix-caching the shared portions within each model's own KV cache. Current systems do not do this because they do not model the pipeline — only individual requests.

4. Problem 2: Latency compounds multiplicatively

A serving system optimised for TTFT (time-to-first-token) and TBT (time-between-tokens) at the individual request level has a simple latency model. In an agent pipeline, these per-hop latencies compound.

End-to-end latency of a sequential agent pipeline

E2E_latency = Σᵢ [prefill_latency(i) + decode_latency(i) + serialisation(i) + network(i)]

Where:
prefill_latency(i) = context_tokens(i) / prefill_throughput(i)
decode_latency(i) = output_tokens(i) × decode_time_per_token(i)
serialisation(i) = output marshalling, tokenisation of upstream output
network(i) = inter-service hop, typically 0.5–5 ms per hop

For our 7-hop pipeline (optimistic estimates):
hop 1: TTFT 12ms + decode 40ms = 52ms
hop 2: TTFT 22ms + decode 80ms = 102ms
hop 3: TTFT 1,800ms + decode 200ms = 2,000ms ← RAG prefill at 256K ctx
hop 4: TTFT 220ms + decode 400ms = 620ms
hop 5: TTFT 110ms + decode 800ms = 910ms
hop 6: TTFT 55ms + decode 120ms = 175ms
hop 7: execution overhead = 300ms

Total E2E: ~4,159ms — nearly 4.2 seconds
A single equivalent 2K request: ~52ms

The 4.2-second figure assumes zero queuing delay at each model and no contention for GPU capacity. In a production cluster serving hundreds of concurrent agent pipelines, each hop waits in a queue. The queuing latency at each hop is independent of the queuing latency at the prior hop — which means tail latency for the full pipeline is the sum of per-hop tail latencies, not the maximum. This is a much worse distribution than single-model serving.

A p99 TTFT of 300ms at each hop produces a p99 end-to-end latency of approximately 2.1 seconds (7 × 300ms) even if every hop individually meets its SLO. The pipeline SLO is a different, harder problem than the per-hop SLO.

5. Problem 3: Speculative branching fragments memory

Many agent architectures include speculative execution: launching multiple branches in parallel and committing to the one that returns the best result. This is correct algorithmically — it reduces end-to-end latency by parallelising uncertainty. But it creates a severe memory management problem.

When a scheduler launches N branches speculatively, it must allocate KV cache space for all N branches simultaneously. If only one branch commits, the KV pages allocated for the N-1 discarded branches must be freed. In a system using PagedAttention or a similar block-allocated KV management scheme, these freed pages become fragmented — they are scattered throughout the HBM in non-contiguous blocks that may not be immediately re-usable for the next incoming request.

HBM fragmentation under speculative branching

For branching factor N=3, context C=16K tokens per branch, 70B FP8:
KV per branch = 16,384 × 1.25 MB = 20 GB per branch
Total speculative KV allocated = 3 × 20 GB = 60 GB
KV committed after branch resolution = 1 × 20 GB
KV freed (fragmented) = 2 × 20 GB = 40 GB

H200 HBM = 141 GB
Useful HBM available for other requests after speculation = 141 - 60 = 81 GB
After commit + deallocation: 141 - 20 = 121 GB (but 40 GB in scattered free blocks)
Effective contiguous usable HBM ≈ 60–80 GB — nearly halved

This fragmentation is not hypothetical. It is the same class of problem that motivated PagedAttention in the first place — but agent pipelines make it significantly worse because the branching and commitment patterns are irregular and hard to predict ahead of time.

6. The anatomy of agent memory movement

If we apply the byte-movement accounting from the previous essay ("The True Cost of a Token") to an agent pipeline, the picture changes materially. In single-model serving, the dominant movement is weights (read once per decode step, amortised over batch) and KV cache (read once per decode step per request). In an agent pipeline, two new movement costs appear:

Cross-hop serialisation movement. The output of each model hop must be serialised (token IDs or text bytes), transmitted, and deserialised before the next hop can begin its prefill. For a 4,096-token output at 2 bytes/token (int16 token IDs), this is 8 KB — trivially small. But for a RAG retriever returning 64K tokens of retrieved context, this is 128 KB, plus any structured metadata. At scale (hundreds of concurrent pipelines), cross-hop traffic through the serving fabric becomes measurable.

Re-prefill cost. Every agent hop that receives context from upstream must run a full prefill of that context in the downstream model. This is not a KV read — it is a full forward pass through all layers for all context tokens. For a 256K-context RAG hop, this re-prefill costs approximately 9.1T FLOPs, ~40× the cost of serving the original 2K user request. This is compute and memory movement that would be avoided if the KV state could be preserved across model boundaries — which it cannot.

Speculative KV allocation and deallocation churn. The HBM fragmentation problem described in Section 5 is also a movement problem: when speculative branches are discarded, the compaction operations required to consolidate free HBM space involve KV page moves — bytes that are moved inside HBM with no benefit to the workload. This is the agent equivalent of GC pause: invisible to the application, expensive to the memory system.

Fig 2. Agent pipelines add three new byte-movement cost categories that are absent from single-model serving: cross-boundary re-prefill (dominant cost), cross-service serialisation traffic, and speculative KV allocation churn. None of these appear in per-model profiling.

7. The Agent Memory Fabric: a scheduling abstraction for pipelines

The three problems above — KV locality loss, multiplicative latency compounding, and speculative memory fragmentation — are not solvable by improving individual model serving. They require a scheduling abstraction that spans the full pipeline.

I will call this the Agent Memory Fabric (AMF). The AMF is not a new hardware layer. It is a control-plane abstraction that has three responsibilities:

Pipeline-scoped KV prefix registry

Track, for each active pipeline execution, which prefix hashes have been prefilled in which models. When a new pipeline starts that shares a prefix with an active or recently completed pipeline, reuse the cached KV state within that model's KV cache rather than re-prefilling. This is RadixAttention lifted to pipeline scope.

Cross-hop latency budgeting

Assign per-hop latency budgets based on the end-to-end SLO and the criticality of each hop's output to the path. Allocate prefill priority and batch slots to each hop accordingly. A hop on the critical path gets priority; a speculative branch gets a lower budget and is killed early if it misses its deadline.

Speculative memory reservation

When a speculative branch is launched, reserve HBM pages in contiguous blocks (using a buddy allocator or equivalent). When branches are discarded, release pages in a way that minimises fragmentation — either deferring release until the commit decision is made, or running a lightweight compaction pass on the freed region.

Cross-model locality awareness

When routing an agent pipeline's model calls to GPU nodes, prefer nodes where the same documents, context, or model-specific KV prefix is already warm in HBM. This is the same principle as NUMA-aware scheduling applied to model-specific KV locality.

7.1 What the AMF does not require

The AMF does not require cross-model KV transfer (which is physically impossible without re-projection through the model). It does not require a shared global memory address space across GPUs. It does not require changes to individual model serving engines. It is a scheduling layer, not a hardware layer — it sits above vLLM, TGI, and similar systems and orchestrates their resource allocation based on pipeline-level context.

8. Cross-model KV reuse: when it is possible and when it is not

A natural question: could KV vectors ever be reused across model boundaries, if the models share architectural properties? The answer depends on what kind of sharing we mean.

Possible (with constraints)

Same model, same prompt prefix — standard prefix caching
Same base model, different LoRA adapters — base model KV can be shared; adapter-specific layers must be recomputed
Distilled model family with identical attention head structure — KV vectors are mathematically compatible (rare in practice)
Same model running on different replicas — KV migration across replicas for load balancing

Not possible (by definition)

Different base models (Llama vs. Mistral vs. Qwen) — incompatible weight matrices produce incompatible KV subspaces
Same model, different quantisation (FP16 KV vs. FP8 KV) — values are numerically incompatible without dequant/requant
Different architectures (MHA vs. GQA vs. MQA) — incompatible head counts and projection dimensions
Cross-modal models (text model KV vs. vision-language KV) — different embedding spaces entirely

For the majority of production agent pipelines, where different models serve different roles in the pipeline, cross-model KV sharing is not possible. The architecture must instead focus on minimising the re-prefill cost — through text-level prefix caching within each model, through aggressive context compression before transmission, and through prefill scheduling that prioritises pipeline-critical hops.

9. Topology-aware routing: placing model calls near their memory

In a serving cluster with many GPU nodes, each serving one or more models, the placement of a model call matters for latency. A call routed to a GPU node where the system prompt and document context are already warm in HBM saves the prefill cost. A call routed to a cold node pays the full prefill.

Current load balancers for model serving use simple round-robin or least-loaded routing. They do not model KV warmth. The AMF should extend routing to include KV locality as a first-class routing signal:

route_request(pipeline_id, hop_id, context_hash):
  # Find GPU nodes with warm KV for this context
  warm_nodes = kv_registry.query(model_id, context_hash)
  
  if warm_nodes and min(warm_nodes.queue_depth) < KV_REUSE_THRESHOLD:
    # Route to warm node; save prefill
    return route_to(warm_nodes.least_loaded())
  
  else:
    # Route to least-loaded node; pay prefill
    cold_node = gpu_pool.least_loaded(model_id)
    kv_registry.register_prefill(cold_node, model_id, context_hash)
    return route_to(cold_node)

The KV_REUSE_THRESHOLD controls the tradeoff between KV warmth and queue depth. A tight threshold favours latency (route to warm even if slightly busier). A loose threshold favours throughput (always route to least-loaded). For agent pipelines on the critical path, a tight threshold is almost always preferable: the cost of a queue wait is additive across all downstream hops, while the benefit of KV reuse is multiplicative.

10. SLO composition across a pipeline

A well-designed single-model serving system can achieve p99 TTFT under 300ms for short contexts and p99 TBT under 30ms. These are hard numbers, achieved through careful batch management, preemption policies, and priority queuing.

In an agent pipeline with N hops, what is the p99 end-to-end latency? Under the independence assumption (hop latencies are independent), it is approximately:

P99 E2E latency under hop independence

P99_E2E ≈ Σᵢ P99_hop(i) + Σᵢ P99_network(i)

For 7 hops, each with P99 = 500ms (accounting for prefill at the RAG hop):
P99_E2E ≈ 7 × 500ms = 3,500ms

But hop latencies are NOT independent:
- Heavy batching at hop i causes queuing delay at hop i+1
- A slow RAG response at hop 3 stalls hops 4–7
- Memory pressure at hop 4 (caused by hop 3's large KV) degrades hop 4's decode rate

Correlated model: P99_E2E ≈ 4,200–6,500ms (2–5 seconds added by correlation)

The correct way to manage E2E SLO in an agent pipeline is to work backward from the end-to-end budget and allocate per-hop latency budgets based on criticality and expected cost. The AMF should implement deadline propagation: each hop receives a deadline derived from the E2E SLO, and the serving scheduler for that model uses the deadline as a priority signal rather than first-in-first-out ordering.

11. How MCOS extends to agent scope

The MCOS framework — memory placement, movement, residency, reuse, admission, and eviction as first-class scheduling objects — was designed with single-model inference in mind. The concepts extend naturally to agent pipelines, but with a change of scope.

MCOS concept	Single-model scope	Agent pipeline scope
Placement	Which tier (SRAM/HBM/NVMe) should a KV block reside in?	Which GPU node, and which memory tier on that node, should a pipeline's KV state reside in, given the subsequent hops it will feed?
Residency	How long should a KV block stay in HBM before eviction?	How long should a hop's output context stay warm in the downstream model's prefix cache, given the probability of future pipeline calls sharing that context?
Reuse	Can this KV block serve multiple concurrent requests?	Can this pipeline's intermediate outputs be shared with other pipelines that share the same sub-graph? Tracked at text-hash level across model boundaries.
Admission	Should a new request enter the fast KV tier?	Should a new agent pipeline be admitted to the cluster, given current pipeline depth, speculative branch count, and HBM fragmentation state?
Movement	When and how should KV blocks be moved between tiers?	When should cross-hop context be compressed before transmission? When should upstream KV state be proactively staged on the downstream GPU before the hop begins?

The extension of MCOS to agent scope does not require changing the core MCOS abstractions. It requires extending the information available to the MCOS scheduler: instead of seeing individual request objects, it must see pipeline DAGs with known structure, estimated hop costs, and cross-hop memory dependencies.

MCOS at single-model scope is a memory scheduler. MCOS at agent scope is a topology planner with memory awareness. The algebra is the same; the object it operates on is a graph, not a request.

12. The infrastructure debt of the agent era

The AI industry is in the middle of an infrastructure debt accumulation. Application developers are building agent pipelines using frameworks (LangChain, AutoGen, LlamaIndex, CrewAI, Claude's computer use API) that abstract away the infrastructure entirely. Each model call goes through an HTTP endpoint. The endpoint dispatches to a vLLM or TGI instance. The instance manages its own KV cache. No component in this stack has a model of the pipeline.

The consequences are visible in production. Agent pipelines that should take 2–3 seconds take 8–12 seconds because of queuing at the RAG hop. Pipelines that should cost $0.02 cost $0.15 because large contexts are being re-prefilled on cold nodes when warm nodes exist. Pipelines that should maintain a 4-second p99 SLO miss it consistently during peak traffic because per-model SLOs are each met but their sum is not.

The infrastructure that will fix this has three components, in order of likely emergence:

Pipeline-aware serving routers. The first fix is the cheapest: route model calls within a pipeline to nodes with warm KV prefix caches, and implement deadline propagation so that hops on the critical path receive serving priority. This requires no changes to individual model serving engines — only a smarter routing layer above them. SGLang's recent work on RadixAttention and constrained decoding scheduling points in this direction.

Cross-hop context compression. The RAG hop is the dominant prefill cost because it injects 64K–256K tokens of retrieved context into the reasoning model. Context compression techniques — learned extractive summarisation, KV compression via attention sparsification, hierarchical context with selective attention — can reduce the effective prefill context by 4–10× while preserving the information needed downstream. This is not a routing problem; it is a model-level and framework-level problem that will be solved by model developers optimising specifically for agent deployment.

Agent Memory Fabric as infrastructure primitive. The full AMF — pipeline-scoped KV registry, speculative memory reservation, cross-hop latency budgeting, topology-aware placement — is a 2–3 year infrastructure bet. It requires serving engines to expose pipeline context to the scheduler, GPUs to maintain cross-request KV warmth registries, and load balancers to model memory topology rather than just queue depth. The companies that build this first will have a structural cost and latency advantage for every agentic workload — which, by 2027, will be the majority of production AI inference.

Single-model serving infrastructure took three years to mature from naive batching to PagedAttention, continuous batching, and speculative decoding. Agent-topology infrastructure is at year zero. The problems are well-defined. The abstractions are transferable from single-model work. What is missing is the systems investment.

The agent pipeline is not a new product. It is a new topology. And topology changes the machine — not the model, not the algorithm, but the machine that holds the memory, routes the bytes, and pays the power bill. If the machine does not understand the topology it is serving, it will serve it expensively and slowly, even as the models themselves become cheaper and faster.

What this essay does not cover

This essay focuses on the infrastructure scheduling problems created by agent pipelines. It does not cover: the application-layer design of agent architectures, the correctness and safety properties of multi-model reasoning systems, the specific engineering of context compression algorithms, or the economics of agent workloads (which requires the cost model from the preceding essay extended to multi-hop pipelines). The MoE routing problem — where within a single model call, expert selection introduces its own residency and scheduling complexity — is also a distinct problem that will be treated separately.