AI is not powered by magic. It is powered by billions of approximate numbers, stored in carefully chosen formats, pushed through specialized hardware at absurd scale. Once you understand floating point, a lot of AI hardware, quantization, and training stability suddenly clicks.
Modern AI is mostly matrix multiplication over floating point numbers. The exact format of those numbers—fp32, fp16, bf16, fp8—changes the cost, speed, memory footprint, and sometimes the stability of the entire system.
That is why AI hardware is built around floating point throughput. CPUs, GPUs, TPUs, and Tensor Cores are all different answers to the same question: how do we do enormous amounts of approximate real-number math, fast enough and cheaply enough, without the model falling apart?
The “number format” is not a low-level detail. It is one of the main levers that determines whether a model trains, whether it fits in memory, and how much infra you need to pay for.
A floating point number is the computer’s way of storing real numbers across a large range. Instead of storing every decimal digit exactly, it stores a sign, a scale, and a precision-limited payload.
The easiest mental model is scientific notation. In decimal, you might write:
Binary floating point does the same thing, but with powers of two instead of powers of ten. In IEEE-style float32, the bits are split into three fields:
| Field | Bits in float32 | What it does |
|---|---|---|
| Sign | 1 | Positive or negative |
| Exponent | 8 | Controls dynamic range: very large vs very tiny values |
| Fraction / significand | 23 | Controls precision: how finely values can be represented |
People often say “mantissa,” but for normalized IEEE floats the more precise modern term is significand. Most readers still recognize “mantissa,” so you can mention both once, then stick to one term consistently.
The practical catch is that you do not get exact real arithmetic. You get a finite approximation. That is why code like this surprises people:
print(0.1 + 0.2) # 0.30000000000000004
print(0.1 + 0.2 == 0.3) # False
That weirdness is not a bug. It is the normal consequence of representing decimal fractions in binary with limited precision.
FPU stands for Floating Point Unit: the execution hardware that performs floating point operations like add, multiply, fused multiply-add, and conversions between formats.
On CPUs, floating point execution lives alongside other execution resources. On GPUs, the machine is scaled around massive parallel floating point throughput. On AI accelerators, the design is often pushed even further toward matrix-oriented datapaths.
| Hardware | Best mental model | Why AI cares |
|---|---|---|
| CPU | Flexible general-purpose engine | Runs orchestration, preprocessing, control flow, some small-batch inference |
| GPU | Massively parallel throughput machine | Excellent for the repeated matrix multiplications inside training and inference |
| TPU / AI accelerator | Even more specialized matrix engine | Pushes utilization and efficiency for large dense tensor operations |
| Tensor Core / matrix unit | Specialized fused matrix datapath | Turns the dominant AI primitive, GEMM, into a first-class hardware fast path |
The important idea is not the exact marketing label. It is that AI workloads are dominated by regular dense math, so hardware gets built to accelerate that exact shape of work.
A neural network is mostly repeated evaluation of expressions like:
Those weights, activations, gradients, optimizer states, and intermediate buffers all live as numeric arrays. In large models, those arrays become the cost center.
| Format | Bytes per weight | 8B parameters | Why it matters |
|---|---|---|---|
| fp32 | 4 | 32 GB | Safe baseline, but expensive |
| fp16 / bf16 | 2 | 16 GB | Halves model memory footprint |
| int8 / fp8 | 1 | 8 GB | Can dramatically improve serving density |
Whether a model fits on one GPU, needs tensor parallelism, or can run on a laptop is often decided first by number format, then by algorithmic tricks built on top of it.
| Format | Range profile | Precision profile | Typical use | Main tradeoff |
|---|---|---|---|---|
| fp32 | High | High | Master weights, accumulation, CPU work, some training paths | Reliable but memory-hungry |
| fp16 | Much narrower exponent range | Better precision than bf16 for same bit budget | Inference, mixed precision training | Can overflow or underflow more easily |
| bf16 | Same exponent range as fp32 | Lower precision than fp16 | Modern training default on much newer hardware | Safer dynamic range, noisier low-bit precision |
| fp8 | Very limited; variant-dependent | Very limited | Aggressive training/inference optimization | Needs scaling and careful kernels |
| int8 / int4 | None (integer range) | Fixed step size | Pure inference density | Strips the dynamic range (exponent) entirely to gain pure density and memory savings |
The most important comparison for modern training is usually fp16 versus bf16. The big reason bf16 won so much mindshare is simple: it preserves float32-like exponent range, which makes training far less fragile in the face of spikes.
import torch
big_num = torch.tensor(50000.0)
f16 = big_num.to(torch.float16)
bf16 = big_num.to(torch.bfloat16)
print(f16) # tensor(50000., dtype=torch.float16)
print(bf16) # tensor(50000., dtype=torch.bfloat16)
print(f16 * 2) # tensor(inf, dtype=torch.float16)
print(bf16 * 2) # tensor(99840., dtype=torch.bfloat16) (approximate, but finite)
One subtle improvement here: it is better to describe the bf16 result as “finite but approximate,” not “exactly fine,” because the format still has very low precision.
If updates become too small to represent, they round to zero and learning stalls. This is one reason mixed-precision training often accumulates into higher precision and uses loss scaling.
Attention computes exponentials. Exponentials grow fast: e^100 easily exceeds the maximum limit of an fp16 float (which caps at ~65504), abruptly blowing up into inf or NaN.
By subtracting the maximum value from all inputs before exponentiating, the highest value becomes 0 (since e^0 = 1), safely anchoring the entire distribution below the overflow limit. This is the small numerical trick that quietly makes modern transformers possible.
Floating point addition is not truly associative. Reordering a large reduction can change the last few bits, and over billions of operations those tiny differences can propagate. That is why “deterministic training” is always more fragile and slower than people expect.
Lower-bit formats save memory and boost throughput, but they do so by snapping values onto a coarser grid. The art of quantization is deciding where that loss is acceptable and where it is not.
Smaller numeric formats reduce memory traffic, register pressure, and silicon cost per arithmetic lane. That usually means more parallel compute can fit into the same chip area and more data can move through the system per second.
This is the deep reason hardware roadmaps care so much about formats. A format change is not only a software detail. It can reshape the architecture of the silicon itself.
There is actually no single fp8 format. Modern hardware like NVIDIA's Hopper architecture utilizes a dual-format split: E4M3 (4 exponent bits, 3 mantissa) is used for forward passes where accuracy/precision matters more, while E5M2 (5 exponent bits, 2 mantissa) is used for backward passes where gradient underflow requires a wider dynamic range.
Lower-precision formats only work if values are mapped into the range the hardware can represent. That is where scaling comes in. Instead of storing raw values directly, we often store a scale factor plus compressed values.
import torch
def quantize_to_fp8_e4m3(tensor):
abs_max = tensor.abs().max()
scale = abs_max / 448.0
q_tensor = (tensor / scale).to(torch.float8_e4m3fn)
return q_tensor, scale
def dequantize_from_fp8(q_tensor, scale):
return q_tensor.to(torch.float32) * scale
The big idea is simple: one extra scale value lets the low-bit tensor use far more of its available representational space. In practice, production systems usually go further with per-channel scaling, delayed scaling, or stochastic rounding.
== unless you truly mean bitwise identity.Floats define the economics of AI. Memory footprint, throughput, serving density, and training stability are all downstream of the number format.
Hardware is co-designed with numerics. Tensor Cores, TPUs, mixed-precision kernels, and quantization stacks exist because the format of the numbers is one of the main bottlenecks.
A lot of “AI systems work” is really numerical systems work. Mixed precision, FlashAttention, QLoRA, AWQ, activation scaling, gradient scaling—these are all different ways of preserving useful behavior while spending fewer bits.
Once you see AI as a problem of moving, storing, and multiplying approximate numbers at scale, the design of the entire stack starts to make sense—from kernels and compilers all the way up to model serving economics.