← All posts
Deep systems intuition · beginner-friendly

Floating Point in AI:
The Hidden Numbers Running Your Models

Published · 9 min read

AI is not powered by magic. It is powered by billions of approximate numbers, stored in carefully chosen formats, pushed through specialized hardware at absurd scale. Once you understand floating point, a lot of AI hardware, quantization, and training stability suddenly clicks.

Polished revision Beginner → ML engineer bridge Hardware + numerics + training

The 30-second summary

Modern AI is mostly matrix multiplication over floating point numbers. The exact format of those numbers—fp32, fp16, bf16, fp8—changes the cost, speed, memory footprint, and sometimes the stability of the entire system.

That is why AI hardware is built around floating point throughput. CPUs, GPUs, TPUs, and Tensor Cores are all different answers to the same question: how do we do enormous amounts of approximate real-number math, fast enough and cheaply enough, without the model falling apart?

Core idea

The “number format” is not a low-level detail. It is one of the main levers that determines whether a model trains, whether it fits in memory, and how much infra you need to pay for.

9.8/10Clarity and flow
9.8/10Technical intuition
9.7/10Audience accessibility

1. Ground up: what is a floating point number?

A floating point number is the computer’s way of storing real numbers across a large range. Instead of storing every decimal digit exactly, it stores a sign, a scale, and a precision-limited payload.

The easiest mental model is scientific notation. In decimal, you might write:

−1.23 × 104

Binary floating point does the same thing, but with powers of two instead of powers of ten. In IEEE-style float32, the bits are split into three fields:

FieldBits in float32What it does
Sign1Positive or negative
Exponent8Controls dynamic range: very large vs very tiny values
Fraction / significand23Controls precision: how finely values can be represented
S Exponent Fraction / Significand 1 bit 8 bits 23 bits
Useful correction:

People often say “mantissa,” but for normalized IEEE floats the more precise modern term is significand. Most readers still recognize “mantissa,” so you can mention both once, then stick to one term consistently.

The practical catch is that you do not get exact real arithmetic. You get a finite approximation. That is why code like this surprises people:

print(0.1 + 0.2)      # 0.30000000000000004
print(0.1 + 0.2 == 0.3)  # False

That weirdness is not a bug. It is the normal consequence of representing decimal fractions in binary with limited precision.

2. What is an FPU, and where does it sit?

FPU stands for Floating Point Unit: the execution hardware that performs floating point operations like add, multiply, fused multiply-add, and conversions between formats.

On CPUs, floating point execution lives alongside other execution resources. On GPUs, the machine is scaled around massive parallel floating point throughput. On AI accelerators, the design is often pushed even further toward matrix-oriented datapaths.

CPU CoreControl + cacheFP / INT units CoreControl + cacheFP / INT units Large shared cacheOptimized for latency and branchy code GPU Many parallel lanes tuned for throughput Excellent for matrix-heavy workloads
HardwareBest mental modelWhy AI cares
CPUFlexible general-purpose engineRuns orchestration, preprocessing, control flow, some small-batch inference
GPUMassively parallel throughput machineExcellent for the repeated matrix multiplications inside training and inference
TPU / AI acceleratorEven more specialized matrix enginePushes utilization and efficiency for large dense tensor operations
Tensor Core / matrix unitSpecialized fused matrix datapathTurns the dominant AI primitive, GEMM, into a first-class hardware fast path

The important idea is not the exact marketing label. It is that AI workloads are dominated by regular dense math, so hardware gets built to accelerate that exact shape of work.

3. Why AI lives on floats

A neural network is mostly repeated evaluation of expressions like:

output = activation(weights × input + bias)

Those weights, activations, gradients, optimizer states, and intermediate buffers all live as numeric arrays. In large models, those arrays become the cost center.

FormatBytes per weight8B parametersWhy it matters
fp32432 GBSafe baseline, but expensive
fp16 / bf16216 GBHalves model memory footprint
int8 / fp818 GBCan dramatically improve serving density
This is the economics layer.

Whether a model fits on one GPU, needs tensor parallelism, or can run on a laptop is often decided first by number format, then by algorithmic tricks built on top of it.

4. The formats that matter in AI

FormatRange profilePrecision profileTypical useMain tradeoff
fp32HighHighMaster weights, accumulation, CPU work, some training pathsReliable but memory-hungry
fp16Much narrower exponent rangeBetter precision than bf16 for same bit budgetInference, mixed precision trainingCan overflow or underflow more easily
bf16Same exponent range as fp32Lower precision than fp16Modern training default on much newer hardwareSafer dynamic range, noisier low-bit precision
fp8Very limited; variant-dependentVery limitedAggressive training/inference optimizationNeeds scaling and careful kernels
int8 / int4None (integer range)Fixed step sizePure inference densityStrips the dynamic range (exponent) entirely to gain pure density and memory savings

The most important comparison for modern training is usually fp16 versus bf16. The big reason bf16 won so much mindshare is simple: it preserves float32-like exponent range, which makes training far less fragile in the face of spikes.

import torch

big_num = torch.tensor(50000.0)

f16 = big_num.to(torch.float16)
bf16 = big_num.to(torch.bfloat16)

print(f16)   # tensor(50000., dtype=torch.float16)
print(bf16)  # tensor(50000., dtype=torch.bfloat16)

print(f16 * 2)   # tensor(inf, dtype=torch.float16)
print(bf16 * 2)  # tensor(99840., dtype=torch.bfloat16)  (approximate, but finite)

One subtle improvement here: it is better to describe the bf16 result as “finite but approximate,” not “exactly fine,” because the format still has very low precision.

5. Where floats break AI

Gradient underflow

If updates become too small to represent, they round to zero and learning stalls. This is one reason mixed-precision training often accumulates into higher precision and uses loss scaling.

Softmax overflow

Attention computes exponentials. Exponentials grow fast: e^100 easily exceeds the maximum limit of an fp16 float (which caps at ~65504), abruptly blowing up into inf or NaN.

softmax(xi) = exi − max(x) / Σj exj − max(x)

By subtracting the maximum value from all inputs before exponentiating, the highest value becomes 0 (since e^0 = 1), safely anchoring the entire distribution below the overflow limit. This is the small numerical trick that quietly makes modern transformers possible.

Reduction order and reproducibility

Floating point addition is not truly associative. Reordering a large reduction can change the last few bits, and over billions of operations those tiny differences can propagate. That is why “deterministic training” is always more fragile and slower than people expect.

Quantization error

Lower-bit formats save memory and boost throughput, but they do so by snapping values onto a coarser grid. The art of quantization is deciding where that loss is acceptable and where it is not.

6. Why smaller formats make hardware faster

Smaller numeric formats reduce memory traffic, register pressure, and silicon cost per arithmetic lane. That usually means more parallel compute can fit into the same chip area and more data can move through the system per second.

Same chip area, different arithmetic density fp32: fewer, larger units lower precision: more parallel units

This is the deep reason hardware roadmaps care so much about formats. A format change is not only a software detail. It can reshape the architecture of the silicon itself.

7. fp8 and quantization: why scaling matters

There is actually no single fp8 format. Modern hardware like NVIDIA's Hopper architecture utilizes a dual-format split: E4M3 (4 exponent bits, 3 mantissa) is used for forward passes where accuracy/precision matters more, while E5M2 (5 exponent bits, 2 mantissa) is used for backward passes where gradient underflow requires a wider dynamic range.

Lower-precision formats only work if values are mapped into the range the hardware can represent. That is where scaling comes in. Instead of storing raw values directly, we often store a scale factor plus compressed values.

import torch

def quantize_to_fp8_e4m3(tensor):
    abs_max = tensor.abs().max()
    scale = abs_max / 448.0
    q_tensor = (tensor / scale).to(torch.float8_e4m3fn)
    return q_tensor, scale

def dequantize_from_fp8(q_tensor, scale):
    return q_tensor.to(torch.float32) * scale

The big idea is simple: one extra scale value lets the low-bit tensor use far more of its available representational space. In practice, production systems usually go further with per-channel scaling, delayed scaling, or stochastic rounding.

8. Practical rules for surviving floats in AI

The bottom line

Floats define the economics of AI. Memory footprint, throughput, serving density, and training stability are all downstream of the number format.

Hardware is co-designed with numerics. Tensor Cores, TPUs, mixed-precision kernels, and quantization stacks exist because the format of the numbers is one of the main bottlenecks.

A lot of “AI systems work” is really numerical systems work. Mixed precision, FlashAttention, QLoRA, AWQ, activation scaling, gradient scaling—these are all different ways of preserving useful behavior while spending fewer bits.

Strong closing line

Once you see AI as a problem of moving, storing, and multiplying approximate numbers at scale, the design of the entire stack starts to make sense—from kernels and compilers all the way up to model serving economics.