How modern AI training efficiency is measured, what low-precision formats actually mean at the bit level, why Hopper changed the economics of large-model training, and how to read utilisation claims without being fooled by the denominator. Now with diagrams, an interactive MFU calculator, and Blackwell FP4.
At scale, even a few percentage points of training efficiency can move total cost by millions of dollars. A 1000-GPU cluster running at 40% MFU versus 55% MFU represents a 37% increase in effective capacity — or equivalently, 37% less hardware to train the same model in the same time.
That is why utilisation metrics attract intense scrutiny. But the metrics are frequently abused, misreported, or simply computed differently by different teams. The confusion usually comes from three layers that get collapsed into one number:
The theoretical FLOPs the model must compute — dominated by GEMMs in attention and FFN layers. This is fixed by architecture.
The vendor-advertised peak TFLOPS at a specific precision. Different for FP32, BF16, FP8, FP4 — often 2–4× apart.
What the full training loop actually achieves, including communication, memory movement, overhead, and idle time.
BF16 is the robust 16-bit workhorse. FP8 is the aggressive low-precision path Hopper made practical. MFU only means something when you know exactly which FLOPs are in the numerator and what precision peak is in the denominator.
Model FLOPs Utilisation (MFU) was defined clearly in the PaLM paper as the ratio between observed throughput and the theoretical throughput you would achieve if the hardware ran at peak the entire time:
This is not the same as "GPU utilisation" as reported by nvidia-smi. A GPU can show 98% utilisation while doing mostly memory traffic or small inefficient kernels — neither of which appears in the numerator of MFU. High GPU activity ≠ high MFU.
For a transformer, the dominant cost is the matrix multiplications. A useful approximation: for a model with P parameters, one forward pass costs roughly 2P FLOPs per token, and a full forward+backward pass costs roughly 6P FLOPs per token (Chinchilla convention includes the backward at ~2× forward).
Every floating-point format divides its bits between three fields: a sign bit, exponent bits (determining numeric range), and mantissa bits (determining precision within a range). The tradeoffs between formats are entirely about how those bits are allocated.
The exponent determines the numeric range — how large or small a number can be before it overflows to infinity or underflows to zero. An 8-bit exponent (like FP32 and BF16) can represent values from roughly 10−38 to 1038. A 5-bit exponent (like FP16) caps out around ±65,504 — which is why FP16 gradients frequently overflow during large-model training.
The mantissa determines precision — how finely you can distinguish between two nearby numbers. FP32's 23 mantissa bits give about 7 significant decimal digits. BF16's 7 bits give roughly 2–3 significant digits. FP8 E4M3's 3 bits give 8 distinct magnitude levels per exponent interval. FP4's single mantissa bit is essentially: this number is either in the lower or upper half of its exponent bucket — nothing more.
BF16 (bfloat16) was designed at Google Brain specifically for machine learning. The key design choice: keep FP32's 8-bit exponent, halve the mantissa to 7 bits. You give up precision; you keep range. For training, range matters more.
Gradient magnitudes during training can vary wildly. FP16's narrow exponent (±65,504 max) means gradients frequently overflow to infinity — requiring careful loss scaling and constant vigilance. BF16 eliminates this almost entirely. Most large model training stacks dropped FP16 in favour of BF16 around 2020–2021.
BF16's 7-bit mantissa means you can only distinguish about 2–3 significant decimal digits. For weights and activations in large models, this is usually fine — neural networks are surprisingly tolerant of numerical noise. The exceptions: accumulations in matrix multiplies (always done in FP32) and optimizer state (usually kept in FP32 or FP64).
In a typical BF16 training recipe, the precision assignments look like this:
// Standard BF16 mixed-precision training weights: BF16 // stored and computed in 16-bit activations: BF16 // forward pass outputs GEMM compute: BF16 // tensor core input GEMM accumulate: FP32 // tensor core internal accumulator gradients: BF16 // backward pass optimizer state: FP32 // Adam m, v — must be high precision master weights: FP32 // copied down to BF16 each step
The "mixed" in mixed-precision means these different tensors live in different formats simultaneously. The GEMM accumulates in FP32 internally even when inputs are BF16 — this is a hardware feature of Tensor Cores, not a software choice.
FP8 is not one format — it is two, with different tradeoffs between range and precision. The FP8 Formats for Deep Learning paper introduced both:
4 exponent bits, 3 mantissa bits. Range: roughly ±448. Used for forward pass activations and weights where you want finer numerical resolution within a moderate range.
5 exponent bits, 2 mantissa bits. Range: roughly ±57,344. Used for backward pass gradients, which can have larger magnitudes and need the wider range more than they need precision.
FP8 has so few representable values (~256 total per variant) that without careful management, most tensors would be quantised to a handful of magnitude levels and lose all useful information. The solution is per-tensor scaling factors: before casting to FP8, you rescale the tensor so its values span the representable range efficiently, then store the inverse scale to dequantise later.
FP8 Tensor Core throughput on H100 is 2× BF16 throughput. But FP8 only applies to the compute-bound operations — primarily GEMMs. Everything else (softmax, layernorm, residual adds, attention score masking) stays in BF16 or FP32. In practice, the wall-clock training speedup from FP8 vs BF16 is typically 1.3–1.5× end-to-end, not 2×, because GEMMs are not the only thing consuming time.
// FP8 mixed-precision training stack (Transformer Engine) weight storage: BF16 // master copy weight for GEMM: FP8 E4M3 // cast at kernel boundary + scale activation storage: BF16 activation for GEMM: FP8 E4M3 gradient for GEMM: FP8 E5M2 // wider range for backward GEMM accumulation: FP32 // hardware accumulator, always optimizer state: FP32 // unchanged all-reduce comm: BF16 // FP8 all-reduce still uncommon
NVIDIA's Blackwell architecture (GB200, B100, B200) introduced native FP4 Tensor Core support. This is genuinely new territory — FP4 has only 16 distinct non-zero values, which means it's operating at the extreme edge of what "floating-point" meaningfully means.
1 sign + 2 exponent + 1 mantissa bit. Representable values: 0, ±1, ±1.5, ±2, ±3, ±4, ±6, ±8, ±12. Yes, that's the complete list. Maximum representable value: 6 (in E2M1 variant).
FP4 Tensor Core throughput on Blackwell is roughly 2× FP8, 4× BF16. B200 peak: ~4.5 PFLOPS FP4. This makes FP4-native inference kernels dramatically faster — when accuracy tolerates it.
The more interesting Blackwell story for training is MX formats (Microscaling) — a block-quantisation scheme where groups of 32 elements share a single scale factor stored in FP8. This allows FP4 mantissas to be used effectively because the shared scale adapts to local magnitude variation within the block.
Here is what a single training step actually looks like when you trace data through the precision stack. This is the picture that "trains in FP8" abstracts over.
The advertised TFLOPS numbers are the denominator of MFU. Getting them wrong by even one precision level produces MFU figures that are not comparable to anyone else's.
Starting from 100% theoretical peak, here is how a real training system loses efficiency at each layer. These are representative numbers for a large transformer on H100s at 100B+ parameter scale.
This is why experts immediately challenge numbers in the high 70s or 90s for end-to-end training. Hitting 90% would mean losing only 10% total to all communication, memory movement, non-TC operations, overhead, and scheduling combined. That is not consistent with any published large-scale training system.
This is the single most common source of inflated MFU claims. Hopper FP8 peak is roughly 2× BF16 peak. If a workload runs its heavy GEMMs in FP8 but MFU is computed against the BF16 denominator, the number exceeds 100% — which is physically impossible and entirely meaningless.
// The "150% MFU" scenario H100 BF16 peak = 989 TFLOPS H100 FP8 peak = 1979 TFLOPS Actual FP8-heavy useful throughput = 1500 TFLOPS // Correct: FP8-relative MFU 1500 / 1979 = 75.8% ← honest // Wrong: BF16-relative MFU 1500 / 989 = 151.7% ← nobody beat physics
Every time you see an MFU headline:
Plug in your own numbers. The calculator computes MFU correctly matched to the right denominator, and flags common mismatches.
Real systems use multiple datatypes simultaneously. This table shows where each format typically appears in a modern training or inference stack.
| Format | Bits | Exp / Man | Range | Typical Use | Strength | Weakness | Status |
|---|---|---|---|---|---|---|---|
FP32 | 32 | 8 / 23 | ~±3.4×10³⁸ | Optimizer state, master weights, accumulators | Full precision, wide range | 4× memory vs BF16, slow | Essential backstop |
TF32 | 19 eff. | 8 / 10 | ~±3.4×10³⁸ | Tensor Core compute mode for FP32-workflow speedup | FP32 range with faster TC math | Not a storage format | Ampere/Hopper default TC mode |
BF16 | 16 | 8 / 7 | ~±3.4×10³⁸ | Weights, activations, gradients in training | FP32 range, 2× speed vs FP32 | Only 2–3 significant digits | Modern training default |
FP16 | 16 | 5 / 10 | ~±65,504 | Older mixed-precision, some inference | More mantissa bits than BF16 | Narrow range → overflow risk | Mostly superseded by BF16 |
FP8 E4M3 | 8 | 4 / 3 | ~±448 | Forward pass GEMMs (weights × activations) | 2× TC throughput vs BF16 on H100 | Needs per-tensor scaling | Hopper production standard |
FP8 E5M2 | 8 | 5 / 2 | ~±57,344 | Backward pass GEMMs (gradient tensors) | Wider range suits gradient magnitudes | Very low precision (4 levels/interval) | Hopper production standard |
INT8 | 8 | — | −128 to 127 | Quantised inference (post-training) | Efficient, widely supported | Training harder; PTQ accuracy risk | Inference staple |
FP4 / NVFP4 | 4 | 2 / 1 | ~±6 | Blackwell inference, experimental training | 4× TC speed vs BF16 on Blackwell | 16 representable values total | Emerging / Blackwell-specific |
INT4 | 4 | — | −8 to 7 | Aggressive inference quantisation (GPTQ, AWQ) | Tiny footprint, 4× KV compression | Visible quality degradation on some tasks | Common in compressed inference |
Same exponent width as FP32 (8 bits) means the same dynamic range. Large-model gradients, which regularly reach values close to FP16's ±65,504 ceiling, simply don't overflow. The precision tradeoff (fewer mantissa bits) turned out not to matter for most training workloads.
When someone says the model "trains in FP8," they almost never mean every tensor everywhere is 8-bit. They mean GEMM inputs are cast to FP8 at kernel boundaries, while accumulation, optimizer state, and some activations remain in BF16/FP32. Mixed-precision training is always a multi-format story.
Start with BF16. It is the default for most modern large-model stacks (Megatron-LM, NeMo, Llama reference implementations). Loss scaling is not needed. Optimizer state stays in FP32.
Investigate FP8 via NVIDIA Transformer Engine. Expect ~1.3–1.5× end-to-end speedup vs BF16. Requires careful validation — some models (particularly those with unusual activation distributions) need tuning of scaling factors.
Use FP8 for near-lossless compression, INT8 for wider hardware support, INT4 (GPTQ/AWQ) for aggressive compression with acceptable quality. Always benchmark accuracy on your target task before shipping.
Keep FP32 for optimizer state (Adam m and v), master weight copies, loss scalar, and any operation where precision directly affects convergence. These are usually a small fraction of total memory.
| Tensor | BF16 training | FP8 training (H100) | INT4 inference |
|---|---|---|---|
| Weights (stored) | BF16 | BF16 | INT4 |
| Weights (for GEMM) | BF16 | FP8 E4M3 | INT4 → BF16 dequant |
| Activations | BF16 | FP8 E4M3 | FP16 / BF16 |
| GEMM accumulation | FP32 | FP32 | FP16 / FP32 |
| Gradients | BF16 | FP8 E5M2 | N/A |
| Optimizer state | FP32 | FP32 | N/A |
| KV cache (long context) | BF16 | BF16 | INT4 / FP8 |
| All-reduce comm | BF16 | BF16 | N/A |
v2 — expanded with bit-layout diagrams, mixed-precision pipeline, throughput benchmarks, MFU waterfall, interactive calculator, and FP4/Blackwell section.