Agentic-NIC-Dataplane-Lab: A Linux Dataplane Architecture for Agentic AI

There are a lot of AI infrastructure repositories that gesture at performance without choosing a clear systems thesis. This one does choose. It says agentic AI is no longer just a model-serving problem. It is a data-path problem, and the Linux host stack, the NIC queue model, and the memory-movement path all show up on the critical path far earlier than many teams expect.

The most interesting thing about Agentic-NIC-Dataplane-Lab is not that it mentions AF_XDP, io_uring, or RDMA. It is that it treats them as different answers to different traffic shapes, instead of treating any one of them as the universal “fast path.”

Why agentic AI changes the NIC story

Traditional microservices and one-shot inference systems make networking feel secondary. The model dominates the discussion. The GPU dominates the budget. The host path is often treated as a generic transport layer that can be optimized later.

Agentic systems break that assumption.

Small RPC pressure

Planner turns, tool calls, retrieval lookups, and memory fetches create lots of small messages where fixed packet overhead starts to dominate.

Bursty fan-out

One agent step can burst into tens of parallel downstream calls, which means queue depth and CPU locality become first-order latency variables.

Bulk state movement

Checkpoint copies, shard sync, vector updates, and GPU feed pipelines create a second workload shape that wants high throughput and low CPU per byte.

Control overhead

Security, fairness, retries, verification, and policy logic make orchestration cost visible even when the model kernel itself is efficient.

The repo’s core framing is exactly right here: agentic AI becomes a queueing and copy problem at scale. That is not just a slogan. It is a statement about what you must measure on real systems:

softirq CPU time
queue-to-core locality
copy and bounce-buffer pressure
completion delivery latency
host CPU consumed per GB moved

Why one transport path is not enough

Agent requestUser turn enters the runtime with tool access, memory access, and policy checks.

Fan-outPlanner, retrieval, storage, safety, and scheduling paths all wake up together.

Mixed trafficSmall RPC, bursty ingress, and bulk east-west flows coexist on the same host.

Queueing problemWakeups, softirqs, cache locality, and IRQ placement become visible.

Copy problemHost mediation, buffer movement, and staging overhead leak into p99.

Transport mismatchNo single stack is optimal for every flow shape anymore.

The repo is built around the idea that agentic workloads are mixed-shape data-path workloads, not just “larger inference.”

The tri-path dataplane is the right abstraction

The most durable idea in the repository is the tri-path split.

Path	Primary role	Best fit	Main risk
Path A	Kernel TCP plus tuning	Mainstream orchestrator, tool, memory, and service RPC	Softirq and copy overhead on hot small-message paths
Path B	`AF_XDP`	Routers, ingress edges, retrieval front doors, token gateways	XDP redirect and UMEM complexity can outweigh gains for tiny RPCs
Path C	`RDMA`	Checkpointing, shard sync, repeated bulk east-west movement	Registration cost, queue-pair complexity, and fabric sensitivity

The tri-path Linux dataplane

Users / upstream agentsRPCs, tool calls, retrieval, streaming, synchronization.

Ingress NIC queuesRSS, XDP classifiers, flow rules, and queue ownership determine the path.

DispatchThe host chooses the path based on workload shape instead of ideological preference.

Path AKernel TCP, SO_REUSEPORT, IRQ affinity, busy_poll, io_uring.

Path BAF_XDP, UMEM, ring ownership, redirect maps, zero-copy mode.

Path CRDMA, registered memory, CQ/QP setup, low host mediation bulk movement.

The key design move is matching transport to traffic shape, not treating “fast networking” as one thing.

That split is not just architecturally neat. It creates a benchmarkable hypothesis. For each path, the repo asks a measurable question:

Can Path A keep mainstream agent RPC simple while pushing copy and wakeup cost down enough?
Can Path B actually beat Path A for the hot queue classes that justify bypass-like complexity?
Can Path C cut CPU cost for sustained bulk state traffic enough to justify its setup burden?

What the repo actually implements today

The repo is still an early lab, but it is early in the right way. It contains enough code to teach the initialization shape of each path without pretending the hard parts are already complete.

AF_XDP starter path

The current AF_XDP sample goes beyond a bind-only stub. It now shows the real conceptual lifecycle:

UMEM allocation and locking
fill-ring prepopulation
RX ring polling
frame recycling back into the fill ring
completion-ring recycling for TX completion paths

#if defined(__has_include)
#  if __has_include(<xdp/xsk.h>)
#    include <xdp/xsk.h>
#  elif __has_include(<bpf/xsk.h>)
#    include <bpf/xsk.h>
#  else
#    error "AF_XDP userspace headers not found"
#  endif
#endif

static int configure_umem(struct afxdp_runtime *rt)
{
    struct xsk_umem_config cfg;
    size_t umem_size = (size_t)NUM_FRAMES * FRAME_SIZE;

    memset(&cfg, 0, sizeof(cfg));
    cfg.fill_size = FQ_RING_SIZE;
    cfg.comp_size = CQ_RING_SIZE;
    cfg.frame_size = FRAME_SIZE;

    posix_memalign(&rt->buffer, getpagesize(), umem_size);
    mlock(rt->buffer, umem_size);

    return xsk_umem__create(&rt->umem,
                            rt->buffer,
                            umem_size,
                            &rt->fq,
                            &rt->cq,
                            &cfg);
}

This is exactly the kind of scaffold that helps people cross the gap between “I know AF_XDP exists” and “I understand the rings, frames, and ownership model enough to reason about it.”

io_uring receive path

The io_uring sample is intentionally modest, but the repo made the right compatibility move: it does not blindly assume the newest helper wrappers are available. That matters because real Linux networking work is full of version mismatches between kernel capability, userspace headers, and distribution packaging.

io_uring_prep_recv(sqe, sockfd, buf, sizeof(buf), 0);
#ifdef IORING_OP_RECV_ZC
    sqe->opcode = IORING_OP_RECV_ZC;
    using_recv_zc = true;
#endif

ret = io_uring_submit(&ring);
ret = io_uring_wait_cqe(&ring, &cqe);

That is a small but important sign of maturity: the repo is not only about architecture ideas, it is also dealing with the actual ugliness of Linux build and packaging reality.

RDMA starter path

The verbs sample remains a starter, but it correctly centers memory registration and queue-pair setup as the real conceptual boundary.

pd = ibv_alloc_pd(ctx);
cq = ibv_create_cq(ctx, 128, NULL, NULL, 0);

attr.qp_type = IBV_QPT_RC;
attr.cap.max_send_wr = 128;
attr.cap.max_recv_wr = 128;

qp = ibv_create_qp(pd, &attr);
mr = ibv_reg_mr(pd,
                buf,
                sizeof(buf),
                IBV_ACCESS_LOCAL_WRITE |
                IBV_ACCESS_REMOTE_READ |
                IBV_ACCESS_REMOTE_WRITE);

The repo is honest that the rest of the work still matters: the full RESET → INIT → RTR → RTS state machine and peer exchange are still ahead.

What the benchmark and results story gets right

The repository is unusually explicit about benchmark honesty.

Instead of vague “faster networking” claims, it defines workload classes that actually resemble agent traffic:

Class A: small agent RPC and fan-out traffic
Class B: retrieval and memory-service traffic
Class C: bulk east-west state movement

It also emits JSON result envelopes rather than just console screenshots, and it already contains an illustrative E810 baseline artifact plus a plotting script. That is a healthy design choice, because systems work ages badly when results are not machine-readable.

Benchmark posture in the repo

Path selectiontcp, io_uring, af_xdp, or rdma.

Workload classAgent RPC, retrieval pressure, or bulk state movement.

Metadata captureKernel, host, softirqs, NIC counters, and generator status.

Result artifactJSON baseline plus plotting support for structured comparison.

The repo treats benchmark reproducibility as part of the system, not an afterthought.

The agentic NIC extension is what makes this repo more than a dataplane lab

Many repositories stop at transport optimization. This one does not. It makes a second leap: from “which Linux path should carry the bytes?” to “what if selected dataplane decisions move closer to the NIC itself, but inside a bounded safety envelope?”

That is where the repo’s newer architecture docs become genuinely differentiated.

Intent → Agent → Guardian → Dataplane → Audit

IntentThe host specifies outcomes such as prioritization, fairness, or protection, not raw register twiddles.

AgentNIC-local logic proposes bounded queue, pacing, or classification changes.

GuardianDeterministic rules approve, clamp, reject, or force fail-safe mode.

DataplaneQueues, steering, flow rules, shaping, and path choice actually change.

AuditReasoning and applied actions are logged with integrity and scope.

This is the repo’s strongest conceptual move: bounded autonomy rather than free-form AI inside the NIC.

That architecture matters because it addresses the obvious failure mode directly: if you say “agentic NIC” too casually, everyone reasonably imagines a control loop that can destabilize queues, violate fairness, or hide its own mistakes. The repository’s answer is the Guardian layer and the audit model.

The Guardian is the deterministic shell that prevents unsafe dataplane actions.
The Audit layer makes decisions inspectable and non-forgeable.
The tenant quota model prevents one local optimization loop from becoming another tenant’s starvation event.

tracepoint:sched:sched_wakeup
/comm == "guardiand"/
{
  @guardian_wakeups = count();
}

tracepoint:sched:sched_switch
/prev_comm == "guardiand" || next_comm == "guardiand"/
{
  @guardian_switches = count();
}

Even the tracing hooks matter here. The repo is no longer saying “trust the guardian.” It is moving toward “instrument the guardian.”

What still has to be proven before this idea grows up

The repo also does something that many ambitious systems projects avoid: it names its own release blockers clearly.

Open question	Why it matters	What proof looks like
Is Path B really faster for tiny RPCs?	`AF_XDP` can lose under sub-512B messages if XDP redirect overhead dominates.	Explicit comparisons at 64B, 128B, 256B, and 512B against Path A with p99 and CPU metrics.
Can the Guardian preserve tail-latency SLOs?	Safety logic that causes jitter is still a regression.	Trace-based evidence that guardian preemption protects rather than destabilizes the hot path.
Who can read reasoning logs?	Audit without a threat model becomes compliance theater.	A scoped read model, integrity story, and tenant isolation guarantees.

This kind of explicit self-critique makes the project stronger, not weaker. It tells future contributors where the hard work still is, and it keeps the repo from overselling partial implementations as finished systems.

Why this repo matters

Agentic-NIC-Dataplane-Lab matters because it sits in a very useful middle layer. It is concrete enough to compile, benchmark, and extend. But it is also ambitious enough to ask what happens after transport optimization, when network policy itself becomes more local, more autonomous, and more safety-critical.

That makes it interesting to three different audiences at once:

Linux performance engineers who want to reason about queue ownership, copies, and completion paths.
AI infrastructure teams who need a sane way to compare kernel TCP, AF_XDP, and RDMA without cargo culting any of them.
systems architects and patent thinkers who care about bounded autonomous NIC behavior, Guardian layers, reasoning logs, and multi-tenant control.

The repo’s real contribution is not just a few code samples. It is a clearer systems vocabulary for talking about how agentic AI stresses Linux networking today, and how a bounded autonomous NIC layer might reasonably evolve from that pressure tomorrow.

Agentic NIC Dataplane Lab

Why agentic AI changes the NIC story

Small RPC pressure

Bursty fan-out

Bulk state movement

Control overhead

The tri-path dataplane is the right abstraction

What the repo actually implements today

AF_XDP starter path

io_uring receive path

RDMA starter path

What the benchmark and results story gets right

The agentic NIC extension is what makes this repo more than a dataplane lab

What still has to be proven before this idea grows up

Why this repo matters

Repository links worth reading first