Research · Systems · Implementation · June 2026

From 1,000 tokens
per second to
decisions per second

Built on Xiaomi MiMo-V2.5-Pro 5 frontier innovations Full implementation in Python

Scroll

01 — The MiMo Moment

What Xiaomi actually achieved

Xiaomi's MiMo-V2.5-Pro-UltraSpeed blog post quietly announced something remarkable: for the first time, a 1-trillion-parameter model runs at over 1,000 tokens per second on commodity 8-GPU hardware. No exotic chips required.

Parameters

1000+

Tokens / sec

6.3

Accepted tokens (DFlash)

Commodity GPUs

Three breakthroughs converged to make this possible. First, selective FP4 quantization — only the MoE expert weights are quantized to MXFP4, while attention and other modules retain higher precision. Second, DFlash speculative decoding — a block-level masked parallel prediction scheme that fills an entire block of draft tokens in one forward pass, eliminating the serial bottleneck of standard autoregressive decoding. Third, TileRT persistent kernels — an execution model that keeps the entire compute pipeline permanently resident inside the GPU, enabling continuous prefetching that fully overlaps data movement with computation.

"Speed itself begins to transmute into intelligence — enabling Best-of-N sampling, tree search, and real-time agent loops that were previously computationally infeasible."

— Xiaomi MiMo team, June 2026

That last sentence is the most important thing in the blog. It points beyond raw throughput toward a different kind of capability: not faster text, but qualitatively richer reasoning. The question we set out to answer: what exactly does that transmutation look like in practice, and can we build it?

02 — The Five Frontiers

What comes next

Thinking as AI researchers, we identified five distinct innovation trajectories that MiMo's breakthrough makes tractable — each attacking a different layer of the stack, from hardware to evaluation.

Innovation 01

⬡

Real-time multi-agent orchestration

At 1000 TPS, you can run N parallel agents simultaneously, let them fork competing reasoning branches, and merge via weighted consensus — all within a single interactive response window.

Agent layer

Innovation 02

⟁

Speculative reasoning

DFlash accelerates token prediction. The next leap is predicting entire reasoning steps — multi-step chains drafted by a small model, verified in one batched call to the 1T oracle.

Reasoning layer

Innovation 03

◈

Dynamic precision routing

FP4 for all MoE experts is a fixed decision. The frontier is adaptive bit-width — FP4, FP8, or BF16 chosen per-token, per-layer based on entropy signals from the routing distribution.

Inference layer

Innovation 04

◫

Memory hierarchy codesign

TileRT makes memory a runtime concern. The next step is making it a model design variable — attention patterns, MoE routing, and KV-cache eviction co-designed with GPU NUMA topology.

Memory layer

Innovation 05

◎

Decisions per second

The real metric isn't tokens per second. It's correct, verifiable, difficulty-weighted decisions per wall-clock second — a benchmark that only becomes measurable at MiMo's throughput ceiling.

Evaluation layer

Key insight

These aren't independent research directions. They form a vertical stack: dynamic precision feeds the inference layer, tiered KV-cache feeds the memory layer, speculative reasoning feeds the agent layer, and D/s measures the whole thing. Build them in order and you get compounding gains.

03 — The Implementation

Five components, one pipeline

We built each innovation as a standalone Python module, then wired them into a unified pipeline. Every component was generated using Codex CLI with detailed task prompts — a meta-demonstration that high-speed LLM inference enables faster software development too.

Entry

Task input

math · coding · triage

↓

Innovation 3 · precision-router/

Dynamic precision router

FP4 · FP8 · BF16 per token via entropy signal

↓

Innovation 4 · tiered-kv-cache/

3-tier KV-cache

HBM → CPU DRAM → NVMe SSD

↓

Innovation 2 · spec-reasoner/

Speculative reasoner

Draft → batch verify → select chain

↓

Innovation 1 · multi-agent-orchestrator/

Multi-agent orchestrator

N parallel agents → consensus vote

↓

Innovation 5 · ds-benchmark/

D/s benchmark harness

Verify · weight by difficulty · score · compare

The D/s formula

The core contribution is formalising a new evaluation metric. Unlike tokens per second — which measures output volume — Decisions per Second measures verified, difficulty-weighted correct outputs per wall-clock second.

# Decisions per Second — the core scoring formula

def compute_ds(results, difficulties, wall_time_s):
    weighted = 0
    for result, difficulty in zip(results, difficulties):
        # False confidence is penalised harder than a wrong answer
        penalty = 2.0 if (not result.correct
                       and result.confidence > 0.8) else 0.0
        weighted += difficulty * result.correct * max(0, 1 - penalty)

    return weighted / wall_time_s  # D/s score

This formula has three important properties. Difficulty weighting means an easy arithmetic problem contributes less than a complex coding task. The false-confidence penalty of 2× means a system that is wrong and certain is worse than one that is simply wrong. And normalising by wall time means you cannot game it by being slow and careful — speed genuinely matters.

The ablation structure

Five configurations are tested automatically — disabling one component at a time — so you can isolate each innovation's marginal contribution to D/s. The component whose removal causes the largest D/s drop is your actual paper contribution.

Hypothesis going in

The speculative reasoner contributes most to D/s (better answer quality), while the precision router contributes most to raw speed. The ablation will confirm or refute this — and the gap between TPS gain and D/s gain is where the interesting science lives.

04 — The Repository

Everything is open

The full implementation — all five components, the unified pipeline, and the benchmark harness — is structured for immediate use. Each module is independently importable and testable. The stack requires no proprietary hardware, consistent with MiMo's philosophy of running frontier models on commodity nodes.

⬡

↗

mimo-pipeline

A suggested repository layout for a five-component AI pipeline built on MiMo-V2.5-Pro, from dynamic precision routing to D/s evaluation.

unified-pipeline/ multi-agent-orchestrator/ spec-reasoner/ precision-router/ tiered-kv-cache/ ds-benchmark/ benchmark_suite.py compare_runs.py

Five-command quickstart

# 1. Clone and install
git clone <your-repo-url>
cd mimo-pipeline && pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# → set MIMO_API_KEY and MIMO_BASE_URL

# 3. Quick smoke test
python unified-pipeline/run_pipeline.py --task "What is 12 * 8?" --domain math

# 4. Run full D/s benchmark
python unified-pipeline/run_benchmark.py --n 30 --domains math coding triage

# 5. Compare two runs
python unified-pipeline/compare_benchmarks.py results_v1.json results_v2.json

05 — What's Next

The path to publication

This codebase is a research prototype. With the benchmark harness running, you have everything needed for a NeurIPS systems track or MLSys submission. The roadmap runs in four phases.

Now · this week

Read and interpret the ablation results

The component impact table tells you which innovation drives D/s most. That single finding shapes your paper's contribution statement. If D/s is below 1.0, identify the bottleneck before moving on.

Soon · 2 weeks

Harden and scale

Add resilience (retries, circuit breakers), observability (Prometheus + Grafana), containerisation (single docker-compose up), and a clean REST API with SSE streaming. Make it shareable.

Month 2

Pick a domain and go deep

Specialise the pipeline for coding (SWE-bench), finance (trade signal generation), or medical triage (ICD-10 verified decisions). A domain-specific result is a stronger paper than a generic one.

Month 3

Submit and open-source

Draft the NeurIPS systems track abstract — you have a novel benchmark (D/s), a novel pipeline (5-component codesign), and ablation results. Package as a PyPI library. Publish the blog post.

The gap between tokens per second and decisions per second is where the next decade of AI research lives. MiMo showed us the ceiling on speed. Now we find out what intelligence costs above it.