MAN\SH AI
Complete Reference · v2

MFU, BF16, FP8,
and AI Numeric Formats

How modern AI training efficiency is measured, what low-precision formats actually mean at the bit level, why Hopper changed the economics of large-model training, and how to read utilisation claims without being fooled by the denominator. Now with diagrams, an interactive MFU calculator, and Blackwell FP4.

Engineers · Researchers · Practitioners Practical understanding, not vendor hype ~18 min read
01

Why These Numbers Matter

At scale, even a few percentage points of training efficiency can move total cost by millions of dollars. A 1000-GPU cluster running at 40% MFU versus 55% MFU represents a 37% increase in effective capacity — or equivalently, 37% less hardware to train the same model in the same time.

That is why utilisation metrics attract intense scrutiny. But the metrics are frequently abused, misreported, or simply computed differently by different teams. The confusion usually comes from three layers that get collapsed into one number:

Layer 1: Model Math

The theoretical FLOPs the model must compute — dominated by GEMMs in attention and FFN layers. This is fixed by architecture.

Layer 2: Hardware Peak

The vendor-advertised peak TFLOPS at a specific precision. Different for FP32, BF16, FP8, FP4 — often 2–4× apart.

Layer 3: System Reality

What the full training loop actually achieves, including communication, memory movement, overhead, and idle time.

Short version
BF16 is the robust 16-bit workhorse. FP8 is the aggressive low-precision path Hopper made practical. MFU only means something when you know exactly which FLOPs are in the numerator and what precision peak is in the denominator.
02

What MFU Actually Means

Model FLOPs Utilisation (MFU) was defined clearly in the PaLM paper as the ratio between observed throughput and the theoretical throughput you would achieve if the hardware ran at peak the entire time:

Definition (PaLM, 2022)
MFU = (observed tokens/sec × model FLOPs/token) / (hardware peak FLOPs/sec)

= actual throughput / theoretical peak throughput

This is not the same as "GPU utilisation" as reported by nvidia-smi. A GPU can show 98% utilisation while doing mostly memory traffic or small inefficient kernels — neither of which appears in the numerator of MFU. High GPU activity ≠ high MFU.

Computing model FLOPs

For a transformer, the dominant cost is the matrix multiplications. A useful approximation: for a model with P parameters, one forward pass costs roughly 2P FLOPs per token, and a full forward+backward pass costs roughly 6P FLOPs per token (Chinchilla convention includes the backward at ~2× forward).

Practical MFU formula
FLOPs per token ≈ 6 × N_params (forward + backward, dense)

MFU = (tokens_per_sec × 6 × N_params) / peak_FLOPS

// Example: 70B model, 1200 tok/s, H100 BF16 peak = 989 TFLOPS
MFU = (1200 × 6 × 70×10⁹) / (989×10¹²) = ≈ 50.8%
The denominator is everything
MFU is only comparable across systems when the FLOPs counting method, precision mode, and denominator peak are aligned. Two teams reporting "50% MFU" may mean very different things if one uses FP8 peak and the other uses BF16 peak.
03

Bit Layouts: What the Formats Actually Look Like

Every floating-point format divides its bits between three fields: a sign bit, exponent bits (determining numeric range), and mantissa bits (determining precision within a range). The tradeoffs between formats are entirely about how those bits are allocated.

Sign (1 bit) Exponent Mantissa FP32 TF32 BF16 FP16 FP8 E4M3 FP8 E5M2 FP4 1 8 exp 23 mantissa bits 32b 1 8 exp 10 mant 13 bits unused in TC 19b eff. 1 8 exponent 7 mantissa 16b 1 5 exp 10 mantissa 16b 1 4 exp 3 mant 8b 1 5 exp 2m 8b 1 2e 1 4b ← same exponent as FP32 = wide range ← 5-bit exp = ±65504 max
Fig 1: Floating-point bit layouts to scale. Width proportional to bit count. BF16's wide exponent (8 bits, matching FP32) is why it avoids overflow/underflow that plagued FP16 training. FP8 E4M3 trades range for precision; E5M2 trades precision for range. FP4 has almost no mantissa — just 4 representable magnitude levels.

What the exponent width controls

The exponent determines the numeric range — how large or small a number can be before it overflows to infinity or underflows to zero. An 8-bit exponent (like FP32 and BF16) can represent values from roughly 10−38 to 1038. A 5-bit exponent (like FP16) caps out around ±65,504 — which is why FP16 gradients frequently overflow during large-model training.

What the mantissa width controls

The mantissa determines precision — how finely you can distinguish between two nearby numbers. FP32's 23 mantissa bits give about 7 significant decimal digits. BF16's 7 bits give roughly 2–3 significant digits. FP8 E4M3's 3 bits give 8 distinct magnitude levels per exponent interval. FP4's single mantissa bit is essentially: this number is either in the lower or upper half of its exponent bucket — nothing more.

Dynamic Range Comparison (log scale) FP32 ~10⁻³⁸ → 10³⁸ BF16 ~10⁻³⁸ → 10³⁸ ✓ FP16 ~10⁻⁸ → 65504 ⚠ FP8 E4M3 ~10⁻⁶ → 448 FP4 tiny range tiny ← numeric range → huge FP16 overflow zone
Fig 2: Dynamic range comparison (log-scale). BF16 and FP32 share the same exponent width — they cover the same numeric range. FP16's narrower exponent is why large-model gradients, which can temporarily reach high magnitudes, overflow so easily. FP8 and FP4 have very limited range and require careful scaling.
04

BF16: The Stable Workhorse

BF16 (bfloat16) was designed at Google Brain specifically for machine learning. The key design choice: keep FP32's 8-bit exponent, halve the mantissa to 7 bits. You give up precision; you keep range. For training, range matters more.

Why BF16 beat FP16 for training

Gradient magnitudes during training can vary wildly. FP16's narrow exponent (±65,504 max) means gradients frequently overflow to infinity — requiring careful loss scaling and constant vigilance. BF16 eliminates this almost entirely. Most large model training stacks dropped FP16 in favour of BF16 around 2020–2021.

The precision sacrifice

BF16's 7-bit mantissa means you can only distinguish about 2–3 significant decimal digits. For weights and activations in large models, this is usually fine — neural networks are surprisingly tolerant of numerical noise. The exceptions: accumulations in matrix multiplies (always done in FP32) and optimizer state (usually kept in FP32 or FP64).

Where BF16 lives in a training step

In a typical BF16 training recipe, the precision assignments look like this:

// Standard BF16 mixed-precision training
weights:          BF16   // stored and computed in 16-bit
activations:      BF16   // forward pass outputs
GEMM compute:     BF16   // tensor core input
GEMM accumulate:  FP32   // tensor core internal accumulator
gradients:        BF16   // backward pass
optimizer state:  FP32   // Adam m, v — must be high precision
master weights:   FP32   // copied down to BF16 each step

The "mixed" in mixed-precision means these different tensors live in different formats simultaneously. The GEMM accumulates in FP32 internally even when inputs are BF16 — this is a hardware feature of Tensor Cores, not a software choice.

05

FP8: Why Hopper Changed the Discussion

FP8 is not one format — it is two, with different tradeoffs between range and precision. The FP8 Formats for Deep Learning paper introduced both:

E4M3 — more precision

4 exponent bits, 3 mantissa bits. Range: roughly ±448. Used for forward pass activations and weights where you want finer numerical resolution within a moderate range.

E5M2 — more range

5 exponent bits, 2 mantissa bits. Range: roughly ±57,344. Used for backward pass gradients, which can have larger magnitudes and need the wider range more than they need precision.

The scaling problem FP8 introduces

FP8 has so few representable values (~256 total per variant) that without careful management, most tensors would be quantised to a handful of magnitude levels and lose all useful information. The solution is per-tensor scaling factors: before casting to FP8, you rescale the tensor so its values span the representable range efficiently, then store the inverse scale to dequantise later.

Transformer Engine's approach
NVIDIA's Transformer Engine maintains a rolling history of tensor maximum values to compute scaling factors automatically. It uses delayed scaling — the scale for the current step is computed from the maximum of the previous few steps. This amortises the overhead of finding per-tensor maxima across steps.

What FP8 actually speeds up

FP8 Tensor Core throughput on H100 is 2× BF16 throughput. But FP8 only applies to the compute-bound operations — primarily GEMMs. Everything else (softmax, layernorm, residual adds, attention score masking) stays in BF16 or FP32. In practice, the wall-clock training speedup from FP8 vs BF16 is typically 1.3–1.5× end-to-end, not 2×, because GEMMs are not the only thing consuming time.

// FP8 mixed-precision training stack (Transformer Engine)
weight storage:        BF16   // master copy
weight for GEMM:       FP8 E4M3 // cast at kernel boundary + scale
activation storage:    BF16
activation for GEMM:   FP8 E4M3
gradient for GEMM:     FP8 E5M2 // wider range for backward
GEMM accumulation:     FP32   // hardware accumulator, always
optimizer state:       FP32   // unchanged
all-reduce comm:       BF16   // FP8 all-reduce still uncommon
06

FP4 and Blackwell: The Next Frontier

NVIDIA's Blackwell architecture (GB200, B100, B200) introduced native FP4 Tensor Core support. This is genuinely new territory — FP4 has only 16 distinct non-zero values, which means it's operating at the extreme edge of what "floating-point" meaningfully means.

NVFP4 format

1 sign + 2 exponent + 1 mantissa bit. Representable values: 0, ±1, ±1.5, ±2, ±3, ±4, ±6, ±8, ±12. Yes, that's the complete list. Maximum representable value: 6 (in E2M1 variant).

Blackwell throughput

FP4 Tensor Core throughput on Blackwell is roughly 2× FP8, 4× BF16. B200 peak: ~4.5 PFLOPS FP4. This makes FP4-native inference kernels dramatically faster — when accuracy tolerates it.

Is FP4 useful for training?
Almost certainly not for the foreseeable future. With only 16 representable values, gradient descent becomes numerically unstable — the quantisation error introduces gradient noise that overwhelms the signal for most model sizes. FP4 is currently a pure inference play: useful for loading weights of already-trained models at extreme compression, not for the training process itself.

The more interesting Blackwell story for training is MX formats (Microscaling) — a block-quantisation scheme where groups of 32 elements share a single scale factor stored in FP8. This allows FP4 mantissas to be used effectively because the shared scale adapts to local magnitude variation within the block.

07

The Mixed-Precision Pipeline: One Full Training Step

Here is what a single training step actually looks like when you trace data through the precision stack. This is the picture that "trains in FP8" abstracts over.

One Training Step — Precision Flow FORWARD PASS FP32 Master Weights cast BF16 / FP8 Weights (W) Input BF16 tokens (X) GEMM W × X → Y [FP8 × FP8 → FP32 accum] FP8 W BF16 X → FP8 cast to BF16 BF16 Act. Y (stored) SoftMax / LN BF16 or FP32 BACKWARD PASS Loss (FP32) scalar Grad (BF16) cast → FP8 E5M2 Grad GEMM [FP8 E5M2 × FP8 E4M3] ∇W (BF16) weight gradients OPTIMIZER STEP Adam State m, v → FP32 (8× memory vs BF16) Weight Update W_fp32 -= lr × ∇W (done in FP32) Cast → BF16 for next forward master W_fp32 kept Repeat for next batch step →
Fig 3: One full training step — precision flow. At no point does a single precision format handle everything. FP8 appears only at GEMM boundaries. FP32 persists throughout optimizer state and accumulation. "Trains in FP8" means the expensive GEMMs use FP8 inputs; everything else remains at higher precision.
The loss scaling question
In FP16 training, explicit loss scaling was required to prevent gradient underflow (gradients too small for FP16 to represent). With BF16's wider range, loss scaling is usually unnecessary. With FP8, a similar problem returns — FP8 E5M2 gradients can underflow at small magnitudes — so modern FP8 stacks use per-tensor dynamic scaling (Transformer Engine's delayed scaling approach) rather than the global loss scaling of the FP16 era.
08

Hardware Throughput Across Precision Tiers

The advertised TFLOPS numbers are the denominator of MFU. Getting them wrong by even one precision level produces MFU figures that are not comparable to anyone else's.

NVIDIA A100 SXM (80 GB) — Tensor Core peaks
FP32
19.5 TFLOPS
TF32
156 TFLOPS
BF16
312 TFLOPS
FP16
312 TFLOPS
FP8
~312 TFLOPS (not native)
INT8
624 TFLOPS
NVIDIA H100 SXM (80 GB) — Tensor Core peaks
FP32
67 TFLOPS
TF32
494 TFLOPS (989 sparse)
BF16
989 TFLOPS
FP8
1979 TFLOPS ← 2× BF16
INT8
1979 TFLOPS
NVIDIA B200 (Blackwell) — estimated peaks
BF16
~2.25 PFLOPS
FP8
~4.5 PFLOPS
FP4
~9 PFLOPS ← 2× FP8
The denominator trap in numbers
H100 FP8 peak is ~2000 TFLOPS. H100 BF16 peak is ~989 TFLOPS. If your workload runs 75% of FP8 peak (1500 TFLOPS actual), reporting MFU against BF16 peak gives 152% — a physically impossible number that just means you used the wrong denominator. Always match precision of numerator to precision of denominator.
09

The MFU Waterfall: Where Efficiency Bleeds Away

Starting from 100% theoretical peak, here is how a real training system loses efficiency at each layer. These are representative numbers for a large transformer on H100s at 100B+ parameter scale.

MFU Waterfall — H100 BF16 Training · 100B Parameter Transformer Hardware Peak TFLOPS 100% After TC ops only (non-TC ops excluded) ▼ −25% ~75% After memory bandwidth limits (attention, layernorm) ▼ −10% ~65% After all-reduce / all-gather communication ▼ −10% ~55% After activation recomputation overhead ▼ −5% ~50% After data loading, logging, misc overhead ▼ −5% ~45% Megatron-LM public H100 benchmarks: 38–52% end-to-end MFU 100%
Fig 4: MFU waterfall. Starting from theoretical peak, each layer of system reality takes a cut. Top-tier public benchmarks from Megatron-LM on H100 clusters land in the 38–52% range for end-to-end training MFU. Claims significantly above this require extraordinary justification — or a different denominator.

This is why experts immediately challenge numbers in the high 70s or 90s for end-to-end training. Hitting 90% would mean losing only 10% total to all communication, memory movement, non-TC operations, overhead, and scheduling combined. That is not consistent with any published large-scale training system.

The best publicly documented training runs on H100 clusters report MFU in the high 40s. A claim of 90% MFU for end-to-end distributed training should provoke the question: 90% of which peak?
10

The BF16 vs FP8 Denominator Trap

This is the single most common source of inflated MFU claims. Hopper FP8 peak is roughly 2× BF16 peak. If a workload runs its heavy GEMMs in FP8 but MFU is computed against the BF16 denominator, the number exceeds 100% — which is physically impossible and entirely meaningless.

// The "150% MFU" scenario
H100 BF16 peak = 989 TFLOPS
H100 FP8 peak  = 1979 TFLOPS

Actual FP8-heavy useful throughput = 1500 TFLOPS

// Correct: FP8-relative MFU
1500 / 1979 = 75.8%   ← honest

// Wrong: BF16-relative MFU
1500 / 989 = 151.7%  ← nobody beat physics
The sparse Tensor Core multiplier
H100 peak numbers are sometimes listed with "sparse" variants — e.g., 1979 TFLOPS dense vs 3958 TFLOPS with structured sparsity (2:4 pattern). Sparse throughput requires the weight matrix to have exactly 50% zeros in a specific pattern. Most models are not trained with this sparsity. Using sparse peak as the denominator doubles the theoretical max and makes otherwise reasonable MFU numbers look half as good — or conversely, using dense throughput against a sparse-capable workload makes it look twice as good.

How to read any performance claim

Every time you see an MFU headline:

  1. Which precision dominates the heavy GEMMs? The denominator must match.
  2. Sparse or dense peak? If sparse: does the workload actually use 2:4 sparsity?
  3. End-to-end or kernel-only? A single GEMM benchmark is not MFU.
  4. What model and sequence length? Dense long-sequence workloads look worse than short-sequence ones.
  5. Does throughput in tok/s and time-to-train agree? MFU should be consistent with observable wall-clock training speed.
11

Interactive MFU Calculator

Plug in your own numbers. The calculator computes MFU correctly matched to the right denominator, and flags common mismatches.

MFU Calculator
GPU hardware
Dominant GEMM precision
Model params (billions) 70B
Observed tokens / sec 1200
Number of GPUs 8
12

AI Numeric Formats — Cheat Sheet

Real systems use multiple datatypes simultaneously. This table shows where each format typically appears in a modern training or inference stack.

FormatBitsExp / Man RangeTypical Use StrengthWeaknessStatus
FP32328 / 23 ~±3.4×10³⁸ Optimizer state, master weights, accumulators Full precision, wide range 4× memory vs BF16, slow Essential backstop
TF3219 eff.8 / 10 ~±3.4×10³⁸ Tensor Core compute mode for FP32-workflow speedup FP32 range with faster TC math Not a storage format Ampere/Hopper default TC mode
BF16168 / 7 ~±3.4×10³⁸ Weights, activations, gradients in training FP32 range, 2× speed vs FP32 Only 2–3 significant digits Modern training default
FP16165 / 10 ~±65,504 Older mixed-precision, some inference More mantissa bits than BF16 Narrow range → overflow risk Mostly superseded by BF16
FP8 E4M384 / 3 ~±448 Forward pass GEMMs (weights × activations) 2× TC throughput vs BF16 on H100 Needs per-tensor scaling Hopper production standard
FP8 E5M285 / 2 ~±57,344 Backward pass GEMMs (gradient tensors) Wider range suits gradient magnitudes Very low precision (4 levels/interval) Hopper production standard
INT88 −128 to 127 Quantised inference (post-training) Efficient, widely supported Training harder; PTQ accuracy risk Inference staple
FP4 / NVFP442 / 1 ~±6 Blackwell inference, experimental training 4× TC speed vs BF16 on Blackwell 16 representable values total Emerging / Blackwell-specific
INT44 −8 to 7 Aggressive inference quantisation (GPTQ, AWQ) Tiny footprint, 4× KV compression Visible quality degradation on some tasks Common in compressed inference

Why BF16 beat FP16 for training

Same exponent width as FP32 (8 bits) means the same dynamic range. Large-model gradients, which regularly reach values close to FP16's ±65,504 ceiling, simply don't overflow. The precision tradeoff (fewer mantissa bits) turned out not to matter for most training workloads.

Why "trains in FP8" is a simplification

When someone says the model "trains in FP8," they almost never mean every tensor everywhere is 8-bit. They mean GEMM inputs are cast to FP8 at kernel boundaries, while accumulation, optimizer state, and some activations remain in BF16/FP32. Mixed-precision training is always a multi-format story.

13

Which Format Should You Use?

Stable large-model training

Start with BF16. It is the default for most modern large-model stacks (Megatron-LM, NeMo, Llama reference implementations). Loss scaling is not needed. Optimizer state stays in FP32.

Maximum training throughput on H100

Investigate FP8 via NVIDIA Transformer Engine. Expect ~1.3–1.5× end-to-end speedup vs BF16. Requires careful validation — some models (particularly those with unusual activation distributions) need tuning of scaling factors.

Efficient deployment / inference

Use FP8 for near-lossless compression, INT8 for wider hardware support, INT4 (GPTQ/AWQ) for aggressive compression with acceptable quality. Always benchmark accuracy on your target task before shipping.

Numerically sensitive operations

Keep FP32 for optimizer state (Adam m and v), master weight copies, loss scalar, and any operation where precision directly affects convergence. These are usually a small fraction of total memory.

A typical modern mixed-precision stack

TensorBF16 trainingFP8 training (H100)INT4 inference
Weights (stored)BF16BF16INT4
Weights (for GEMM)BF16FP8 E4M3INT4 → BF16 dequant
ActivationsBF16FP8 E4M3FP16 / BF16
GEMM accumulationFP32FP32FP16 / FP32
GradientsBF16FP8 E5M2N/A
Optimizer stateFP32FP32N/A
KV cache (long context)BF16BF16INT4 / FP8
All-reduce commBF16BF16N/A
The fundamental insight
Modern ML training is not a single-precision computation. It is a carefully engineered pipeline where each tensor is assigned the lowest precision that does not harm convergence or final model quality — and high precision is reserved for the numerically sensitive operations that genuinely need it. The art is knowing which operations those are.

Sources

v2 — expanded with bit-layout diagrams, mixed-precision pipeline, throughput benchmarks, MFU waterfall, interactive calculator, and FP4/Blackwell section.