The GPU is the central hardware artefact of the AI era, yet it is almost always treated as a black box — a thing that runs matrix multiplies and produces tokens. This essay opens the box: what lives inside an SM, why warps exist, how tensor cores achieve their throughput, and what the architecture's design choices mean for AI workloads specifically.
A modern CPU core — an Intel Raptor Cove, an ARM Cortex-X4 — devotes the majority of its transistor budget to managing complexity: out-of-order execution buffers, branch predictors, large L1/L2/L3 caches, instruction prefetch queues, speculative execution machinery. These are not compute engines — they are mechanisms to make one thread of sequential code run as fast as possible by predicting what it will do next and pre-executing it.
A GPU core makes the opposite trade. It eliminates nearly all complexity-management hardware. No deep out-of-order buffers. No branch prediction. Minimal caches. In its place: an enormous number of simple execution units and a warp-based threading model designed to tolerate memory latency by switching between hundreds of in-flight threads instead of predicting around it.
This is not the GPU being unsophisticated. It is the GPU making a deliberate trade optimised for a specific class of workload: tasks with massive data parallelism, where the same operation is applied to thousands of independent data elements simultaneously, and where memory latency is unavoidable but can be hidden by keeping enough work in flight.
The CPU optimises for latency of a single thread. The GPU optimises for throughput of thousands of threads. Both are correct — they serve different workloads. The key is understanding which one you have.
The Streaming Multiprocessor (SM) is the fundamental building block of the GPU. Every GPU is a collection of SMs connected to an HBM interface and an L2 cache. The H100 SXM5 has 132 SMs. The H200 SXM5 has the same 132 SMs with more HBM.
Each SM is itself a complete, independent compute system with its own instruction scheduler, register file, CUDA cores, tensor cores, special function units, shared memory, and L1 cache. SMs run independently and can execute entirely different programs simultaneously — this is what makes GPUs suitable for running different kernels concurrently.
A warp is a group of 32 threads that execute together in lockstep — the same instruction, applied to 32 different data elements simultaneously. This is Single Instruction Multiple Threads (SIMT) execution. The warp is the atomic unit of GPU scheduling: you cannot schedule part of a warp, and all 32 threads in a warp execute the same instruction at the same time.
Why 32? This reflects the width of the execution units. A GPU's CUDA cores are grouped in sets of 32 (a "lane" per thread), so one warp's worth of FP32 additions — 32 simultaneous multiplications — fills the execution unit naturally. It also reflects the register file organisation: registers are banked in 32-thread groups for simultaneous access.
The critical insight is what warps do when they encounter a memory access — a load from HBM or L2 that takes hundreds of clock cycles to return. On a CPU, this causes a cache miss stall: the pipeline freezes and waits. A GPU does something entirely different:
This is called latency hiding through thread-level parallelism, and it is the GPU's fundamental answer to slow memory. A CPU uses a different strategy (caches and prefetch) but a GPU simply keeps so many threads in flight that memory latency becomes invisible.
The register file is the fastest memory in the GPU — faster than L1 cache, faster than shared memory, faster than anything else on the chip. It is also enormous by cache standards: the H100 SM has a 256 KB register file, providing 65,536 individual 32-bit registers. These registers are the working memory for all the warps resident in the SM.
The reason the register file is so large is that all resident warps share it. If the SM has 64 warps resident and each warp's 32 threads uses 128 registers each: 64 × 32 × 128 = 262,144 registers — exactly the SM's capacity. This is not a coincidence. The register file is sized to hold the register state of all possible resident warps simultaneously, so that context-switching between warps costs zero cycles — there is no save/restore, no register spill. The warp's state is always live in the register file.
This zero-cost context switching is what makes warp-based latency hiding efficient. Unlike CPU thread switching, which requires saving registers to memory (hundreds of cycles), GPU warp switching takes one cycle — just a pointer to a different warp's register file bank.
The H100 SM has 256 KB of on-chip SRAM that serves as a unified pool for two distinct purposes: L1 cache (managed automatically by hardware) and shared memory (managed explicitly by the programmer or compiler). The split between them is configurable per kernel — a programmer can allocate up to 228 KB as shared memory, leaving the remainder as L1 cache.
Shared memory is the closest thing a GPU has to a software-managed scratchpad. It allows threads within the same thread block (a group of up to 1,024 threads spanning multiple warps on the same SM) to communicate through fast on-chip SRAM without going to HBM. For tiled matrix multiplication — the core operation of transformer attention and FFN layers — shared memory is used to load tiles of the A and B matrices from HBM into SRAM, perform the multiply-accumulate, and then move to the next tile. This tile-based approach is what allows GPU matrix multiply to reuse data in SRAM for multiple operations rather than reading it fresh from HBM each time.
CUDA cores are the general-purpose arithmetic units of the SM — each one can perform one 32-bit floating-point multiply-add (FMA) per clock cycle. With 128 FP32 CUDA cores per SM and 132 SMs in the H100:
CUDA cores are general-purpose: they can execute any FP32 or INT32 instruction, including non-matrix operations like element-wise activations (ReLU, GELU), normalisation, and control flow. They are the fallback for anything that doesn't fit the tensor core's specific matrix-multiply-accumulate operation.
Tensor cores are specialised compute units that perform one operation that CUDA cores cannot: a matrix multiply-accumulate (MMA) on entire tiles of matrices in a single instruction. A 4th-generation Hopper tensor core performs the operation:
D = A × B + C
where A, B, C, D are small matrices — for FP8, the tile size is 16×16×16 (a 16×16 result from a 16×16 × 16×16 multiplication). In one clock cycle, one tensor core unit computes 16 × 16 × 16 × 2 = 8,192 FP8 multiply-add operations. Compare this to a CUDA core, which performs 2 FP32 operations per cycle.
The key distinction: a CUDA core is a scalar unit. It adds two numbers or multiplies two numbers. A tensor core is a matrix unit — it operates on an entire tile of a matrix simultaneously. The silicon is organised differently: instead of a general-purpose ALU, a tensor core contains a fixed-function array of multiply-accumulate cells arranged in a matrix pattern, with dedicated wiring to fan operands across the array in parallel.
The accumulator precision deserves attention. Although the input matrices A and B can be FP8 (8-bit floating point), the accumulator C and result D are kept in FP32. This is deliberate: 8-bit arithmetic has limited range and loses precision quickly when summing many products. By accumulating in 32-bit, the tensor core trades some throughput reduction in the accumulation step for numerical stability in the final result. The input data movement is FP8 (fewer bytes to read from memory), while the arithmetic accuracy is FP32.
Before tensor cores (pre-Volta, 2017), running a neural network on a GPU meant using CUDA cores to do scalar matrix multiplication — one element at a time. A matrix multiply of two 4096×4096 matrices required 4096³ = 68.7 billion FMA operations, each executed serially through the CUDA core pipeline. The throughput was limited by the number of CUDA cores and their clock rate.
With tensor cores, the same matrix multiply is decomposed into tiles of 16×16 matrices, each tile computed in a single tensor core operation. The arithmetic intensity per clock cycle jumps by over 4,000×. The H100's 989 TFLOP/s FP8 peak is almost entirely tensor core throughput — CUDA cores contribute only ~67 TFLOP/s.
This is why the H100's AI benchmark numbers are so much higher than its FP32 FLOP count suggests. The FP32 FLOP count reflects CUDA core capacity. The AI throughput reflects tensor core capacity. They are different silicon doing different work.
| GPU | Architecture | FP32 CUDA cores (TFLOP/s) | FP16 Tensor (TFLOP/s) | FP8 Tensor (TFLOP/s) | TC / CUDA ratio |
|---|---|---|---|---|---|
| V100 | Volta (2017) | 14 | 112 | — | 8× |
| A100 | Ampere (2020) | 19.5 | 312 | — | 16× |
| H100 | Hopper (2022) | 66.9 | 494 | 989 | 14.8× |
| B200 | Blackwell (2025) | ~90 | ~1,800 | ~4,500 | ~50× |
GPU utilisation as reported by nvidia-smi is often misleading for AI workloads. A GPU can report 100% utilisation while its tensor cores are actually idle — the utilisation metric reflects whether the GPU is executing any instructions, not whether its most powerful execution units are busy.
The more meaningful metric is occupancy — the ratio of active warps to the maximum possible warps per SM. Low occupancy means the SM has fewer warps in flight than it needs to hide memory latency, and stalls become visible. High occupancy means memory latency is hidden and execution units are kept busy.
Occupancy is limited by three resources, and the most constrained of the three determines the ceiling:
The practical consequence for AI: matrix multiply kernels (FlashAttention, cuBLAS GEMM) are carefully hand-tuned to maximise occupancy by controlling tile sizes, register usage, and shared memory allocation. A poorly tuned kernel can achieve 30% of peak throughput on the same hardware as an optimised kernel achieving 85%.
| Resource | H100 SXM5 | Per SM | AI implication |
|---|---|---|---|
| SMs | 132 | — | 132 independent compute units; run different thread blocks simultaneously |
| CUDA cores (FP32) | 16,896 | 128 | Scalar ops: activations, normalisation, non-matrix work |
| Tensor cores (4th gen) | 528 | 4 | Matrix tiles: attention GEMM, FFN GEMM — 989 TFLOP/s FP8 |
| Register file | ~33.5 MB | 256 KB | Zero-latency, zero-cost warp context switching |
| L1 / Shared memory | ~33.5 MB | 256 KB | Tile staging for matrix ops; inter-thread communication |
| L2 cache | 50 MB | ~380 KB | Catches repeated weight reads; reduces HBM traffic |
| Max warps per SM | 8,448 | 64 | Latency hiding capacity |
| Max threads per SM | 270,336 | 2,048 | 64 warps × 32 threads |
| Warp schedulers | 528 | 4 | 4 instructions issued per SM per cycle |
The GPU's architecture is extraordinarily well-matched to one phase of AI inference: prefill. Prefill — processing the full input prompt — is a large batch matrix multiply: a matrix of input activations against the model's weight matrices. Tensor cores exist precisely for this operation. Prefill at 989 TFLOP/s is the GPU operating in its designed regime.
Decode — generating tokens one at a time — is the opposite. At batch size 1, each decode step is a matrix-vector multiply, not a matrix-matrix multiply. A matrix-vector multiply does not fill the tensor cores efficiently: the "B matrix" has only one column. The tensor core tile is 16×16, but the batch is 1×16 — 15 of the 16 tile rows are wasted. Arithmetic intensity collapses to ~1.6 FLOP/byte. The GPU reverts to waiting for HBM at 3.35 TB/s.
This is the fundamental architectural mismatch between GPU design and autoregressive inference: the tensor cores that give the GPU its headline throughput numbers are structurally underutilised during the token generation phase that dominates real serving workloads. The SM, the warp scheduler, the register file — all of this infrastructure exists to support massive parallelism, but decode is an inherently serial workload where the next token depends on the previous one.
The GPU's tensor cores are a remarkable engineering achievement — and they are nearly irrelevant during autoregressive decode. Prefill uses them well. Decode doesn't. And decode is where users wait. This architectural reality is the deepest motivation for every alternative inference accelerator architecture: wafer-scale SRAM chips, near-memory compute, and specialised decode accelerators that trade tensor core throughput for memory bandwidth.