Deep systems intuition · beginner-friendly

Floating Point in AI:
The Hidden Numbers Running Your Models

Published Apr 15, 2026 · 9 min read

AI is not powered by magic. It is powered by billions of approximate numbers, stored in carefully chosen formats, pushed through specialized hardware at absurd scale. Once you understand floating point, a lot of AI hardware, quantization, and training stability suddenly clicks.

Polished revision Beginner → ML engineer bridge Hardware + numerics + training

The 30-second summary

Modern AI is mostly matrix multiplication over floating point numbers. The exact format of those numbers—fp32, fp16, bf16, fp8—changes the cost, speed, memory footprint, and sometimes the stability of the entire system.

That is why AI hardware is built around floating point throughput. CPUs, GPUs, TPUs, and Tensor Cores are all different answers to the same question: how do we do enormous amounts of approximate real-number math, fast enough and cheaply enough, without the model falling apart?

Core idea

The “number format” is not a low-level detail. It is one of the main levers that determines whether a model trains, whether it fits in memory, and how much infra you need to pay for.

9.8/10Clarity and flow

9.8/10Technical intuition

9.7/10Audience accessibility

1. Ground up: what is a floating point number?

A floating point number is the computer’s way of storing real numbers across a large range. Instead of storing every decimal digit exactly, it stores a sign, a scale, and a precision-limited payload.

The easiest mental model is scientific notation. In decimal, you might write:

−1.23 × 10⁴

Binary floating point does the same thing, but with powers of two instead of powers of ten. In IEEE-style float32, the bits are split into three fields:

Field	Bits in float32	What it does
Sign	1	Positive or negative
Exponent	8	Controls dynamic range: very large vs very tiny values
Fraction / significand	23	Controls precision: how finely values can be represented

Useful correction:

People often say “mantissa,” but for normalized IEEE floats the more precise modern term is significand. Most readers still recognize “mantissa,” so you can mention both once, then stick to one term consistently.

The practical catch is that you do not get exact real arithmetic. You get a finite approximation. That is why code like this surprises people:

print(0.1 + 0.2)      # 0.30000000000000004
print(0.1 + 0.2 == 0.3)  # False

That weirdness is not a bug. It is the normal consequence of representing decimal fractions in binary with limited precision.

2. What is an FPU, and where does it sit?

FPU stands for Floating Point Unit: the execution hardware that performs floating point operations like add, multiply, fused multiply-add, and conversions between formats.

On CPUs, floating point execution lives alongside other execution resources. On GPUs, the machine is scaled around massive parallel floating point throughput. On AI accelerators, the design is often pushed even further toward matrix-oriented datapaths.

Hardware	Best mental model	Why AI cares
CPU	Flexible general-purpose engine	Runs orchestration, preprocessing, control flow, some small-batch inference
GPU	Massively parallel throughput machine	Excellent for the repeated matrix multiplications inside training and inference
TPU / AI accelerator	Even more specialized matrix engine	Pushes utilization and efficiency for large dense tensor operations
Tensor Core / matrix unit	Specialized fused matrix datapath	Turns the dominant AI primitive, GEMM, into a first-class hardware fast path

The important idea is not the exact marketing label. It is that AI workloads are dominated by regular dense math, so hardware gets built to accelerate that exact shape of work.

3. Why AI lives on floats

A neural network is mostly repeated evaluation of expressions like:

output = activation(weights × input + bias)

Those weights, activations, gradients, optimizer states, and intermediate buffers all live as numeric arrays. In large models, those arrays become the cost center.

Format	Bytes per weight	8B parameters	Why it matters
fp32	4	32 GB	Safe baseline, but expensive
fp16 / bf16	2	16 GB	Halves model memory footprint
int8 / fp8	1	8 GB	Can dramatically improve serving density

This is the economics layer.

Whether a model fits on one GPU, needs tensor parallelism, or can run on a laptop is often decided first by number format, then by algorithmic tricks built on top of it.

4. The formats that matter in AI

Format	Range profile	Precision profile	Typical use	Main tradeoff
fp32	High	High	Master weights, accumulation, CPU work, some training paths	Reliable but memory-hungry
fp16	Much narrower exponent range	Better precision than bf16 for same bit budget	Inference, mixed precision training	Can overflow or underflow more easily
bf16	Same exponent range as fp32	Lower precision than fp16	Modern training default on much newer hardware	Safer dynamic range, noisier low-bit precision
fp8	Very limited; variant-dependent	Very limited	Aggressive training/inference optimization	Needs scaling and careful kernels
int8 / int4	None (integer range)	Fixed step size	Pure inference density	Strips the dynamic range (exponent) entirely to gain pure density and memory savings

The most important comparison for modern training is usually fp16 versus bf16. The big reason bf16 won so much mindshare is simple: it preserves float32-like exponent range, which makes training far less fragile in the face of spikes.

import torch

big_num = torch.tensor(50000.0)

f16 = big_num.to(torch.float16)
bf16 = big_num.to(torch.bfloat16)

print(f16)   # tensor(50000., dtype=torch.float16)
print(bf16)  # tensor(50000., dtype=torch.bfloat16)

print(f16 * 2)   # tensor(inf, dtype=torch.float16)
print(bf16 * 2)  # tensor(99840., dtype=torch.bfloat16)  (approximate, but finite)

One subtle improvement here: it is better to describe the bf16 result as “finite but approximate,” not “exactly fine,” because the format still has very low precision.

5. Where floats break AI

Gradient underflow

If updates become too small to represent, they round to zero and learning stalls. This is one reason mixed-precision training often accumulates into higher precision and uses loss scaling.

Softmax overflow

Attention computes exponentials. Exponentials grow fast: e^100 easily exceeds the maximum limit of an fp16 float (which caps at ~65504), abruptly blowing up into inf or NaN.

softmax(x_i) = e^{x_i − max(x)} / Σ_j e^{x_j − max(x)}

By subtracting the maximum value from all inputs before exponentiating, the highest value becomes 0 (since e^0 = 1), safely anchoring the entire distribution below the overflow limit. This is the small numerical trick that quietly makes modern transformers possible.

Reduction order and reproducibility

Floating point addition is not truly associative. Reordering a large reduction can change the last few bits, and over billions of operations those tiny differences can propagate. That is why “deterministic training” is always more fragile and slower than people expect.

Quantization error

Lower-bit formats save memory and boost throughput, but they do so by snapping values onto a coarser grid. The art of quantization is deciding where that loss is acceptable and where it is not.

6. Why smaller formats make hardware faster

Smaller numeric formats reduce memory traffic, register pressure, and silicon cost per arithmetic lane. That usually means more parallel compute can fit into the same chip area and more data can move through the system per second.

This is the deep reason hardware roadmaps care so much about formats. A format change is not only a software detail. It can reshape the architecture of the silicon itself.

7. fp8 and quantization: why scaling matters

There is actually no single fp8 format. Modern hardware like NVIDIA's Hopper architecture utilizes a dual-format split: E4M3 (4 exponent bits, 3 mantissa) is used for forward passes where accuracy/precision matters more, while E5M2 (5 exponent bits, 2 mantissa) is used for backward passes where gradient underflow requires a wider dynamic range.

Lower-precision formats only work if values are mapped into the range the hardware can represent. That is where scaling comes in. Instead of storing raw values directly, we often store a scale factor plus compressed values.

import torch

def quantize_to_fp8_e4m3(tensor):
    abs_max = tensor.abs().max()
    scale = abs_max / 448.0
    q_tensor = (tensor / scale).to(torch.float8_e4m3fn)
    return q_tensor, scale

def dequantize_from_fp8(q_tensor, scale):
    return q_tensor.to(torch.float32) * scale

The big idea is simple: one extra scale value lets the low-bit tensor use far more of its available representational space. In practice, production systems usually go further with per-channel scaling, delayed scaling, or stochastic rounding.

8. Practical rules for surviving floats in AI

Do not compare floats with raw == unless you truly mean bitwise identity.
Use higher-precision accumulation when the reduction is large.
Normalize aggressively when value ranges can explode.
Treat NaNs and infs as first-class debugging signals, not random accidents.
Remember that reproducibility gets harder as parallelism and reduced precision increase.

The bottom line

Floats define the economics of AI. Memory footprint, throughput, serving density, and training stability are all downstream of the number format.

Hardware is co-designed with numerics. Tensor Cores, TPUs, mixed-precision kernels, and quantization stacks exist because the format of the numbers is one of the main bottlenecks.

A lot of “AI systems work” is really numerical systems work. Mixed precision, FlashAttention, QLoRA, AWQ, activation scaling, gradient scaling—these are all different ways of preserving useful behavior while spending fewer bits.

Strong closing line

Once you see AI as a problem of moving, storing, and multiplying approximate numbers at scale, the design of the entire stack starts to make sense—from kernels and compilers all the way up to model serving economics.