AI Infrastructure Essay · gemma4-wdc

gemma4-wdc: A middleware layer that stops local agents from doing the same work twice

As local multi-agent systems get practical, the next bottleneck is not the models. It is the duplicated downstream work underneath them. This essay explains the middleware primitive, the bounded admission window, the safety case, and the laptop-scale prototype I built to make that behavior visible.

April 2026Local AIMiddlewareShared Execution

Something clicked for me about six months ago when I was watching a local agent setup run three parallel branches, planner, coder, reviewer, and noticed that all three were independently scanning the same repo to understand the codebase. Three identical backend operations. Zero shared state between them. The models were fine. The middleware was the problem.

That observation turned into gemma4-wdc, a laptop-first runtime that detects semantically overlapping tool tasks from concurrent agents and collapses them into a single shared execution unit before any backend work is duplicated. The results fan back out to every agent that submitted a matching task.

The models were not the problem. The wasted backend work underneath them was.

The problem, concretely

Local multi-agent systems branch. A planner spawns a coder and a reviewer. A research agent forks into three parallel extraction paths. A coding agent runs a diff pipeline alongside a test-validation pipeline. Each branch is locally rational. It asks for the information it needs. But neighboring branches often need the same information.

On cloud infrastructure, this is annoying but survivable. You pay a bit more, latency spikes, and you move on. On consumer hardware running local models, it is a real problem. Duplicated repo scans, redundant document extraction passes, repeated API calls: these stack up fast and they are entirely avoidable.

The thesis

This is a middleware problem as much as a model problem. You cannot prompt your way out of it. The runtime layer needs to handle it.

How the runtime works

The core primitive is a Shared Execution Unit, or SEU. When a task arrives, the runtime fingerprints it by computing a canonical key, an exact structural hash, and a semantic comparison vector. It then checks whether a compatible SEU is already open.

If one is, the incoming task attaches as a subscriber. If not, a new SEU opens and starts a bounded admission window. This is the key timing primitive: the window is non-resetting, meaning it does not extend when new tasks arrive. It closes, the SEU executes once, and the result fans out to every subscriber.

Runtime flow

1
Task ingress
2
Fingerprinting
3
Admission window
4
Execute once (SEU)
5
Fan-out
6
Metrics

Matching uses two passes: exact structural overlap first, fast and with zero false positives, then lightweight near-duplicate similarity where it is safe. The two-pass approach means the system does not over-collapse. Tasks that look similar but are not materially the same stay separate.

The admission window design

The non-resetting window is a deliberate choice and worth explaining. A resetting window, one that extends each time a new task arrives, sounds appealing because it maximizes collapse. But it creates unbounded latency: a steady stream of similar tasks would keep the window open indefinitely. Nothing executes.

The bounded non-resetting window trades some collapse opportunity for predictable execution timing. The first arrival opens the window. Matching tasks attach. The window closes at a fixed deadline regardless of what has happened since. This only starts to feel obviously right once you think through the failure modes. A resetting window feels better until you reason about steady-state behavior.

Design note

The admission window is non-resetting by design. A resetting window maximizes collapse but creates unbounded latency under load. Fixed deadline, predictable execution.

The safety case matters as much as the savings

The benchmark suite includes a false_collapse_safety scenario explicitly for this reason. It constructs tasks that look semantically similar, same resource type, overlapping keywords, structurally close, but differ in ways that make shared execution wrong. The runtime should keep those apart.

Current local harness results
ScenarioTasksExecutionsSavedDedup ratioFalse collapses
coding_repo_scan4222.0x0.00
document_research3211.5x0.00
api_fanout3211.5x0.00
false_collapse_safety4401.0x0.00

These are local harness numbers from mock executors. I have been upfront about that throughout the project. The goal is not to claim production throughput. It is to make the middleware behavior visible and auditable on a single laptop.

14tasks requested across the current benchmark families
10actual executions after deduplication
0.00false-collapse rate in the visible safety scenario

Simulation-first is a first-class design choice

gemma4-wdc defaults to simulation mode, where agents are lightweight processes emitting structured tool tasks. No real local model is required. This is intentional. The thesis is about the middleware layer, not about how many Gemma instances you can cram onto a MacBook Pro.

Hybrid mode exists for one optional real model adapter to participate alongside simulated agents, keeping the demo laptop-friendly while still demonstrating the full integration path. But simulation mode is the product, not a fallback. It is the mode that makes the systems thesis legible.

Run It Yourself
backend
cd runtime/shared_execution/backend
python -m venv .venv && .\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
uvicorn app.main:app --reload
Second Terminal
frontend
cd runtime/shared_execution/frontend
python -m http.server 4173
# open http://localhost:4173

What I would do differently

The “Gemma” in the name is partly aspirational right now. The middleware works independently of the model family, which is actually the right property, but it means the branding slightly misrepresents the current state. I would either lean harder into the real model integration path, or reframe it as a model-agnostic middleware layer and let that be the story. Both are honest. The current middle ground is the weakest position.

I would also build richer explainability earlier. The dashboard shows that tasks collapsed, but not deeply why, what similarity score triggered it, which fingerprint matched, what the admission window state was at each step. That audit trail would make the safety case much stronger and make the runtime more useful as a debugging tool.

Where this is going

The roadmap has three honest next steps: a stronger real-model adapter path, better replay and explainability with annotated collapse reasoning and replayable scenario logs, and only after the single-machine thesis is solid, early work on multi-node coordination.

The broader bet is that as agentic workflows become a default pattern rather than an experimental one, the middleware layer between models and backends becomes genuinely important infrastructure. gemma4-wdc is a proof of concept for one slice of that, deduplication and shared execution, but the design surface is much larger.

If you are building in this space, thinking about agent runtimes, or just want to poke at the code, the repo is open. Issues, feedback, and pull requests are welcome.

Try gemma4-wdc

Laptop-scale, simulation-first, offline. No API key required.