MFU, BF16, FP8 and AI Numeric Formats: The Complete Guide

01

Why These Numbers Matter

At scale, even a few percentage points of training efficiency can move total cost by millions of dollars. A 1000-GPU cluster running at 40% MFU versus 55% MFU represents a 37% increase in effective capacity — or equivalently, 37% less hardware to train the same model in the same time.

That is why utilisation metrics attract intense scrutiny. But the metrics are frequently abused, misreported, or simply computed differently by different teams. The confusion usually comes from three layers that get collapsed into one number:

Layer 1: Model Math

The theoretical FLOPs the model must compute — dominated by GEMMs in attention and FFN layers. This is fixed by architecture.

Layer 2: Hardware Peak

The vendor-advertised peak TFLOPS at a specific precision. Different for FP32, BF16, FP8, FP4 — often 2–4× apart.

Layer 3: System Reality

What the full training loop actually achieves, including communication, memory movement, overhead, and idle time.

Short version

BF16 is the robust 16-bit workhorse. FP8 is the aggressive low-precision path Hopper made practical. MFU only means something when you know exactly which FLOPs are in the numerator and what precision peak is in the denominator.

02

What MFU Actually Means

Model FLOPs Utilisation (MFU) was defined clearly in the PaLM paper as the ratio between observed throughput and the theoretical throughput you would achieve if the hardware ran at peak the entire time:

Definition (PaLM, 2022)

MFU = (observed tokens/sec × model FLOPs/token) / (hardware peak FLOPs/sec)

= actual throughput / theoretical peak throughput

This is not the same as "GPU utilisation" as reported by nvidia-smi. A GPU can show 98% utilisation while doing mostly memory traffic or small inefficient kernels — neither of which appears in the numerator of MFU. High GPU activity ≠ high MFU.

Computing model FLOPs

For a transformer, the dominant cost is the matrix multiplications. A useful approximation: for a model with P parameters, one forward pass costs roughly 2P FLOPs per token, and a full forward+backward pass costs roughly 6P FLOPs per token (Chinchilla convention includes the backward at ~2× forward).

Practical MFU formula

FLOPs per token ≈ 6 × N_params (forward + backward, dense)

MFU = (tokens_per_sec × 6 × N_params) / peak_FLOPS

// Example: 70B model, 1200 tok/s, H100 BF16 peak = 989 TFLOPS
MFU = (1200 × 6 × 70×10⁹) / (989×10¹²) = ≈ 50.8%

The denominator is everything

MFU is only comparable across systems when the FLOPs counting method, precision mode, and denominator peak are aligned. Two teams reporting "50% MFU" may mean very different things if one uses FP8 peak and the other uses BF16 peak.

03

Bit Layouts: What the Formats Actually Look Like

Every floating-point format divides its bits between three fields: a sign bit, exponent bits (determining numeric range), and mantissa bits (determining precision within a range). The tradeoffs between formats are entirely about how those bits are allocated.

Fig 1: Floating-point bit layouts to scale. Width proportional to bit count. BF16's wide exponent (8 bits, matching FP32) is why it avoids overflow/underflow that plagued FP16 training. FP8 E4M3 trades range for precision; E5M2 trades precision for range. FP4 has almost no mantissa — just 4 representable magnitude levels.

What the exponent width controls

The exponent determines the numeric range — how large or small a number can be before it overflows to infinity or underflows to zero. An 8-bit exponent (like FP32 and BF16) can represent values from roughly 10⁻³⁸ to 10³⁸. A 5-bit exponent (like FP16) caps out around ±65,504 — which is why FP16 gradients frequently overflow during large-model training.

What the mantissa width controls

The mantissa determines precision — how finely you can distinguish between two nearby numbers. FP32's 23 mantissa bits give about 7 significant decimal digits. BF16's 7 bits give roughly 2–3 significant digits. FP8 E4M3's 3 bits give 8 distinct magnitude levels per exponent interval. FP4's single mantissa bit is essentially: this number is either in the lower or upper half of its exponent bucket — nothing more.

Fig 2: Dynamic range comparison (log-scale). BF16 and FP32 share the same exponent width — they cover the same numeric range. FP16's narrower exponent is why large-model gradients, which can temporarily reach high magnitudes, overflow so easily. FP8 and FP4 have very limited range and require careful scaling.

04

BF16: The Stable Workhorse

BF16 (bfloat16) was designed at Google Brain specifically for machine learning. The key design choice: keep FP32's 8-bit exponent, halve the mantissa to 7 bits. You give up precision; you keep range. For training, range matters more.

Why BF16 beat FP16 for training

Gradient magnitudes during training can vary wildly. FP16's narrow exponent (±65,504 max) means gradients frequently overflow to infinity — requiring careful loss scaling and constant vigilance. BF16 eliminates this almost entirely. Most large model training stacks dropped FP16 in favour of BF16 around 2020–2021.

The precision sacrifice

BF16's 7-bit mantissa means you can only distinguish about 2–3 significant decimal digits. For weights and activations in large models, this is usually fine — neural networks are surprisingly tolerant of numerical noise. The exceptions: accumulations in matrix multiplies (always done in FP32) and optimizer state (usually kept in FP32 or FP64).

Where BF16 lives in a training step

In a typical BF16 training recipe, the precision assignments look like this:

// Standard BF16 mixed-precision training
weights:          BF16   // stored and computed in 16-bit
activations:      BF16   // forward pass outputs
GEMM compute:     BF16   // tensor core input
GEMM accumulate:  FP32   // tensor core internal accumulator
gradients:        BF16   // backward pass
optimizer state:  FP32   // Adam m, v — must be high precision
master weights:   FP32   // copied down to BF16 each step

The "mixed" in mixed-precision means these different tensors live in different formats simultaneously. The GEMM accumulates in FP32 internally even when inputs are BF16 — this is a hardware feature of Tensor Cores, not a software choice.

05

FP8: Why Hopper Changed the Discussion

FP8 is not one format — it is two, with different tradeoffs between range and precision. The FP8 Formats for Deep Learning paper introduced both:

E4M3 — more precision

4 exponent bits, 3 mantissa bits. Range: roughly ±448. Used for forward pass activations and weights where you want finer numerical resolution within a moderate range.

E5M2 — more range

5 exponent bits, 2 mantissa bits. Range: roughly ±57,344. Used for backward pass gradients, which can have larger magnitudes and need the wider range more than they need precision.

The scaling problem FP8 introduces

FP8 has so few representable values (~256 total per variant) that without careful management, most tensors would be quantised to a handful of magnitude levels and lose all useful information. The solution is per-tensor scaling factors: before casting to FP8, you rescale the tensor so its values span the representable range efficiently, then store the inverse scale to dequantise later.

Transformer Engine's approach

NVIDIA's Transformer Engine maintains a rolling history of tensor maximum values to compute scaling factors automatically. It uses delayed scaling — the scale for the current step is computed from the maximum of the previous few steps. This amortises the overhead of finding per-tensor maxima across steps.

What FP8 actually speeds up

FP8 Tensor Core throughput on H100 is 2× BF16 throughput. But FP8 only applies to the compute-bound operations — primarily GEMMs. Everything else (softmax, layernorm, residual adds, attention score masking) stays in BF16 or FP32. In practice, the wall-clock training speedup from FP8 vs BF16 is typically 1.3–1.5× end-to-end, not 2×, because GEMMs are not the only thing consuming time.

// FP8 mixed-precision training stack (Transformer Engine)
weight storage:        BF16   // master copy
weight for GEMM:       FP8 E4M3 // cast at kernel boundary + scale
activation storage:    BF16
activation for GEMM:   FP8 E4M3
gradient for GEMM:     FP8 E5M2 // wider range for backward
GEMM accumulation:     FP32   // hardware accumulator, always
optimizer state:       FP32   // unchanged
all-reduce comm:       BF16   // FP8 all-reduce still uncommon

06

FP4 and Blackwell: The Next Frontier

NVIDIA's Blackwell architecture (GB200, B100, B200) introduced native FP4 Tensor Core support. This is genuinely new territory — FP4 has only 16 distinct non-zero values, which means it's operating at the extreme edge of what "floating-point" meaningfully means.

NVFP4 format

1 sign + 2 exponent + 1 mantissa bit. Representable values: 0, ±1, ±1.5, ±2, ±3, ±4, ±6, ±8, ±12. Yes, that's the complete list. Maximum representable value: 6 (in E2M1 variant).

Blackwell throughput

FP4 Tensor Core throughput on Blackwell is roughly 2× FP8, 4× BF16. B200 peak: ~4.5 PFLOPS FP4. This makes FP4-native inference kernels dramatically faster — when accuracy tolerates it.

Is FP4 useful for training?

Almost certainly not for the foreseeable future. With only 16 representable values, gradient descent becomes numerically unstable — the quantisation error introduces gradient noise that overwhelms the signal for most model sizes. FP4 is currently a pure inference play: useful for loading weights of already-trained models at extreme compression, not for the training process itself.

The more interesting Blackwell story for training is MX formats (Microscaling) — a block-quantisation scheme where groups of 32 elements share a single scale factor stored in FP8. This allows FP4 mantissas to be used effectively because the shared scale adapts to local magnitude variation within the block.

07

The Mixed-Precision Pipeline: One Full Training Step

Here is what a single training step actually looks like when you trace data through the precision stack. This is the picture that "trains in FP8" abstracts over.

Fig 3: One full training step — precision flow. At no point does a single precision format handle everything. FP8 appears only at GEMM boundaries. FP32 persists throughout optimizer state and accumulation. "Trains in FP8" means the expensive GEMMs use FP8 inputs; everything else remains at higher precision.

The loss scaling question

In FP16 training, explicit loss scaling was required to prevent gradient underflow (gradients too small for FP16 to represent). With BF16's wider range, loss scaling is usually unnecessary. With FP8, a similar problem returns — FP8 E5M2 gradients can underflow at small magnitudes — so modern FP8 stacks use per-tensor dynamic scaling (Transformer Engine's delayed scaling approach) rather than the global loss scaling of the FP16 era.

08

Hardware Throughput Across Precision Tiers

The advertised TFLOPS numbers are the denominator of MFU. Getting them wrong by even one precision level produces MFU figures that are not comparable to anyone else's.

NVIDIA A100 SXM (80 GB) — Tensor Core peaks

FP32

19.5 TFLOPS

TF32

156 TFLOPS

BF16

312 TFLOPS

FP16

312 TFLOPS

FP8

~312 TFLOPS (not native)

INT8

624 TFLOPS

NVIDIA H100 SXM (80 GB) — Tensor Core peaks

FP32

67 TFLOPS

TF32

494 TFLOPS (989 sparse)

BF16

989 TFLOPS

FP8

1979 TFLOPS ← 2× BF16

INT8

1979 TFLOPS

NVIDIA B200 (Blackwell) — estimated peaks

BF16

~2.25 PFLOPS

FP8

~4.5 PFLOPS

FP4

~9 PFLOPS ← 2× FP8

The denominator trap in numbers

H100 FP8 peak is ~2000 TFLOPS. H100 BF16 peak is ~989 TFLOPS. If your workload runs 75% of FP8 peak (1500 TFLOPS actual), reporting MFU against BF16 peak gives 152% — a physically impossible number that just means you used the wrong denominator. Always match precision of numerator to precision of denominator.

09

The MFU Waterfall: Where Efficiency Bleeds Away

Starting from 100% theoretical peak, here is how a real training system loses efficiency at each layer. These are representative numbers for a large transformer on H100s at 100B+ parameter scale.

Fig 4: MFU waterfall. Starting from theoretical peak, each layer of system reality takes a cut. Top-tier public benchmarks from Megatron-LM on H100 clusters land in the 38–52% range for end-to-end training MFU. Claims significantly above this require extraordinary justification — or a different denominator.

This is why experts immediately challenge numbers in the high 70s or 90s for end-to-end training. Hitting 90% would mean losing only 10% total to all communication, memory movement, non-TC operations, overhead, and scheduling combined. That is not consistent with any published large-scale training system.

The best publicly documented training runs on H100 clusters report MFU in the high 40s. A claim of 90% MFU for end-to-end distributed training should provoke the question: 90% of which peak?

10

The BF16 vs FP8 Denominator Trap

This is the single most common source of inflated MFU claims. Hopper FP8 peak is roughly 2× BF16 peak. If a workload runs its heavy GEMMs in FP8 but MFU is computed against the BF16 denominator, the number exceeds 100% — which is physically impossible and entirely meaningless.

// The "150% MFU" scenario
H100 BF16 peak = 989 TFLOPS
H100 FP8 peak  = 1979 TFLOPS

Actual FP8-heavy useful throughput = 1500 TFLOPS

// Correct: FP8-relative MFU
1500 / 1979 = 75.8%   ← honest

// Wrong: BF16-relative MFU
1500 / 989 = 151.7%  ← nobody beat physics

The sparse Tensor Core multiplier

H100 peak numbers are sometimes listed with "sparse" variants — e.g., 1979 TFLOPS dense vs 3958 TFLOPS with structured sparsity (2:4 pattern). Sparse throughput requires the weight matrix to have exactly 50% zeros in a specific pattern. Most models are not trained with this sparsity. Using sparse peak as the denominator doubles the theoretical max and makes otherwise reasonable MFU numbers look half as good — or conversely, using dense throughput against a sparse-capable workload makes it look twice as good.

How to read any performance claim

Every time you see an MFU headline:

Which precision dominates the heavy GEMMs? The denominator must match.
Sparse or dense peak? If sparse: does the workload actually use 2:4 sparsity?
End-to-end or kernel-only? A single GEMM benchmark is not MFU.
What model and sequence length? Dense long-sequence workloads look worse than short-sequence ones.
Does throughput in tok/s and time-to-train agree? MFU should be consistent with observable wall-clock training speed.

11

Interactive MFU Calculator

Plug in your own numbers. The calculator computes MFU correctly matched to the right denominator, and flags common mismatches.

MFU Calculator

GPU hardware

Dominant GEMM precision

Model params (billions) 70B

Observed tokens / sec 1200

Number of GPUs 8

12

AI Numeric Formats — Cheat Sheet

Real systems use multiple datatypes simultaneously. This table shows where each format typically appears in a modern training or inference stack.

Format	Bits	Exp / Man	Range	Typical Use	Strength	Weakness	Status
`FP32`	32	8 / 23	~±3.4×10³⁸	Optimizer state, master weights, accumulators	Full precision, wide range	4× memory vs BF16, slow	Essential backstop
`TF32`	19 eff.	8 / 10	~±3.4×10³⁸	Tensor Core compute mode for FP32-workflow speedup	FP32 range with faster TC math	Not a storage format	Ampere/Hopper default TC mode
`BF16`	16	8 / 7	~±3.4×10³⁸	Weights, activations, gradients in training	FP32 range, 2× speed vs FP32	Only 2–3 significant digits	Modern training default
`FP16`	16	5 / 10	~±65,504	Older mixed-precision, some inference	More mantissa bits than BF16	Narrow range → overflow risk	Mostly superseded by BF16
`FP8 E4M3`	8	4 / 3	~±448	Forward pass GEMMs (weights × activations)	2× TC throughput vs BF16 on H100	Needs per-tensor scaling	Hopper production standard
`FP8 E5M2`	8	5 / 2	~±57,344	Backward pass GEMMs (gradient tensors)	Wider range suits gradient magnitudes	Very low precision (4 levels/interval)	Hopper production standard
`INT8`	8	—	−128 to 127	Quantised inference (post-training)	Efficient, widely supported	Training harder; PTQ accuracy risk	Inference staple
`FP4 / NVFP4`	4	2 / 1	~±6	Blackwell inference, experimental training	4× TC speed vs BF16 on Blackwell	16 representable values total	Emerging / Blackwell-specific
`INT4`	4	—	−8 to 7	Aggressive inference quantisation (GPTQ, AWQ)	Tiny footprint, 4× KV compression	Visible quality degradation on some tasks	Common in compressed inference

Why BF16 beat FP16 for training

Same exponent width as FP32 (8 bits) means the same dynamic range. Large-model gradients, which regularly reach values close to FP16's ±65,504 ceiling, simply don't overflow. The precision tradeoff (fewer mantissa bits) turned out not to matter for most training workloads.

Why "trains in FP8" is a simplification

When someone says the model "trains in FP8," they almost never mean every tensor everywhere is 8-bit. They mean GEMM inputs are cast to FP8 at kernel boundaries, while accumulation, optimizer state, and some activations remain in BF16/FP32. Mixed-precision training is always a multi-format story.

13

Which Format Should You Use?

Stable large-model training

Start with BF16. It is the default for most modern large-model stacks (Megatron-LM, NeMo, Llama reference implementations). Loss scaling is not needed. Optimizer state stays in FP32.

Maximum training throughput on H100

Investigate FP8 via NVIDIA Transformer Engine. Expect ~1.3–1.5× end-to-end speedup vs BF16. Requires careful validation — some models (particularly those with unusual activation distributions) need tuning of scaling factors.

Efficient deployment / inference

Use FP8 for near-lossless compression, INT8 for wider hardware support, INT4 (GPTQ/AWQ) for aggressive compression with acceptable quality. Always benchmark accuracy on your target task before shipping.

Numerically sensitive operations

Keep FP32 for optimizer state (Adam m and v), master weight copies, loss scalar, and any operation where precision directly affects convergence. These are usually a small fraction of total memory.

A typical modern mixed-precision stack

Tensor	BF16 training	FP8 training (H100)	INT4 inference
Weights (stored)	`BF16`	`BF16`	`INT4`
Weights (for GEMM)	`BF16`	`FP8 E4M3`	`INT4 → BF16 dequant`
Activations	`BF16`	`FP8 E4M3`	`FP16 / BF16`
GEMM accumulation	`FP32`	`FP32`	`FP16 / FP32`
Gradients	`BF16`	`FP8 E5M2`	N/A
Optimizer state	`FP32`	`FP32`	N/A
KV cache (long context)	`BF16`	`BF16`	`INT4 / FP8`
All-reduce comm	`BF16`	`BF16`	N/A

The fundamental insight

Modern ML training is not a single-precision computation. It is a carefully engineered pipeline where each tensor is assigned the lowest precision that does not harm convergence or final model quality — and high precision is reserved for the numerically sensitive operations that genuinely need it. The art is knowing which operations those are.

MFU, BF16, FP8,
and AI Numeric Formats

Why These Numbers Matter

Layer 1: Model Math

Layer 2: Hardware Peak

Layer 3: System Reality

What MFU Actually Means

Computing model FLOPs

Bit Layouts: What the Formats Actually Look Like

What the exponent width controls

What the mantissa width controls

BF16: The Stable Workhorse

Why BF16 beat FP16 for training

The precision sacrifice

Where BF16 lives in a training step

FP8: Why Hopper Changed the Discussion

E4M3 — more precision

E5M2 — more range

The scaling problem FP8 introduces

What FP8 actually speeds up

FP4 and Blackwell: The Next Frontier

NVFP4 format

Blackwell throughput

The Mixed-Precision Pipeline: One Full Training Step

Hardware Throughput Across Precision Tiers

The MFU Waterfall: Where Efficiency Bleeds Away

The BF16 vs FP8 Denominator Trap

How to read any performance claim

Interactive MFU Calculator

AI Numeric Formats — Cheat Sheet

Why BF16 beat FP16 for training

Why "trains in FP8" is a simplification

Which Format Should You Use?

Stable large-model training

Maximum training throughput on H100

Efficient deployment / inference

Numerically sensitive operations

A typical modern mixed-precision stack

Sources

MFU, BF16, FP8,and AI Numeric Formats

Why These Numbers Matter

Layer 1: Model Math

Layer 2: Hardware Peak

Layer 3: System Reality

What MFU Actually Means

Computing model FLOPs

Bit Layouts: What the Formats Actually Look Like

What the exponent width controls

What the mantissa width controls

BF16: The Stable Workhorse

Why BF16 beat FP16 for training

The precision sacrifice

Where BF16 lives in a training step

FP8: Why Hopper Changed the Discussion

E4M3 — more precision

E5M2 — more range

The scaling problem FP8 introduces

What FP8 actually speeds up

FP4 and Blackwell: The Next Frontier

NVFP4 format

Blackwell throughput

The Mixed-Precision Pipeline: One Full Training Step

Hardware Throughput Across Precision Tiers

The MFU Waterfall: Where Efficiency Bleeds Away

The BF16 vs FP8 Denominator Trap

How to read any performance claim

Interactive MFU Calculator

AI Numeric Formats — Cheat Sheet

Why BF16 beat FP16 for training

Why "trains in FP8" is a simplification

Which Format Should You Use?

Stable large-model training

Maximum training throughput on H100

Efficient deployment / inference

Numerically sensitive operations

A typical modern mixed-precision stack

Sources

MFU, BF16, FP8,
and AI Numeric Formats