From 1,000 tokens
per second to
decisions per second
What Xiaomi actually achieved
Xiaomi's MiMo-V2.5-Pro-UltraSpeed blog post quietly announced something remarkable: for the first time, a 1-trillion-parameter model runs at over 1,000 tokens per second on commodity 8-GPU hardware. No exotic chips required.
Three breakthroughs converged to make this possible. First, selective FP4 quantization — only the MoE expert weights are quantized to MXFP4, while attention and other modules retain higher precision. Second, DFlash speculative decoding — a block-level masked parallel prediction scheme that fills an entire block of draft tokens in one forward pass, eliminating the serial bottleneck of standard autoregressive decoding. Third, TileRT persistent kernels — an execution model that keeps the entire compute pipeline permanently resident inside the GPU, enabling continuous prefetching that fully overlaps data movement with computation.
"Speed itself begins to transmute into intelligence — enabling Best-of-N sampling, tree search, and real-time agent loops that were previously computationally infeasible."
— Xiaomi MiMo team, June 2026That last sentence is the most important thing in the blog. It points beyond raw throughput toward a different kind of capability: not faster text, but qualitatively richer reasoning. The question we set out to answer: what exactly does that transmutation look like in practice, and can we build it?
What comes next
Thinking as AI researchers, we identified five distinct innovation trajectories that MiMo's breakthrough makes tractable — each attacking a different layer of the stack, from hardware to evaluation.
At 1000 TPS, you can run N parallel agents simultaneously, let them fork competing reasoning branches, and merge via weighted consensus — all within a single interactive response window.
Agent layerDFlash accelerates token prediction. The next leap is predicting entire reasoning steps — multi-step chains drafted by a small model, verified in one batched call to the 1T oracle.
Reasoning layerFP4 for all MoE experts is a fixed decision. The frontier is adaptive bit-width — FP4, FP8, or BF16 chosen per-token, per-layer based on entropy signals from the routing distribution.
Inference layerTileRT makes memory a runtime concern. The next step is making it a model design variable — attention patterns, MoE routing, and KV-cache eviction co-designed with GPU NUMA topology.
Memory layerThe real metric isn't tokens per second. It's correct, verifiable, difficulty-weighted decisions per wall-clock second — a benchmark that only becomes measurable at MiMo's throughput ceiling.
Evaluation layerThese aren't independent research directions. They form a vertical stack: dynamic precision feeds the inference layer, tiered KV-cache feeds the memory layer, speculative reasoning feeds the agent layer, and D/s measures the whole thing. Build them in order and you get compounding gains.
Five components, one pipeline
We built each innovation as a standalone Python module, then wired them into a unified pipeline. Every component was generated using Codex CLI with detailed task prompts — a meta-demonstration that high-speed LLM inference enables faster software development too.
The D/s formula
The core contribution is formalising a new evaluation metric. Unlike tokens per second — which measures output volume — Decisions per Second measures verified, difficulty-weighted correct outputs per wall-clock second.
# Decisions per Second — the core scoring formula
def compute_ds(results, difficulties, wall_time_s):
weighted = 0
for result, difficulty in zip(results, difficulties):
# False confidence is penalised harder than a wrong answer
penalty = 2.0 if (not result.correct
and result.confidence > 0.8) else 0.0
weighted += difficulty * result.correct * max(0, 1 - penalty)
return weighted / wall_time_s # D/s score
This formula has three important properties. Difficulty weighting means an easy arithmetic problem contributes less than a complex coding task. The false-confidence penalty of 2× means a system that is wrong and certain is worse than one that is simply wrong. And normalising by wall time means you cannot game it by being slow and careful — speed genuinely matters.
The ablation structure
Five configurations are tested automatically — disabling one component at a time — so you can isolate each innovation's marginal contribution to D/s. The component whose removal causes the largest D/s drop is your actual paper contribution.
The speculative reasoner contributes most to D/s (better answer quality), while the precision router contributes most to raw speed. The ablation will confirm or refute this — and the gap between TPS gain and D/s gain is where the interesting science lives.
Everything is open
The full implementation — all five components, the unified pipeline, and the benchmark harness — is structured for immediate use. Each module is independently importable and testable. The stack requires no proprietary hardware, consistent with MiMo's philosophy of running frontier models on commodity nodes.
A suggested repository layout for a five-component AI pipeline built on MiMo-V2.5-Pro, from dynamic precision routing to D/s evaluation.
Five-command quickstart
# 1. Clone and install
git clone <your-repo-url>
cd mimo-pipeline && pip install -r requirements.txt
# 2. Configure
cp .env.example .env
# → set MIMO_API_KEY and MIMO_BASE_URL
# 3. Quick smoke test
python unified-pipeline/run_pipeline.py --task "What is 12 * 8?" --domain math
# 4. Run full D/s benchmark
python unified-pipeline/run_benchmark.py --n 30 --domains math coding triage
# 5. Compare two runs
python unified-pipeline/compare_benchmarks.py results_v1.json results_v2.json
The path to publication
This codebase is a research prototype. With the benchmark harness running, you have everything needed for a NeurIPS systems track or MLSys submission. The roadmap runs in four phases.
The gap between tokens per second and decisions per second is where the next decade of AI research lives. MiMo showed us the ceiling on speed. Now we find out what intelligence costs above it.