GPU ArchitectureAI HardwareTensor CoresWarp Scheduling

Inside the GPU: SMs, Warps, Tensor Cores, and Why the Architecture Looks the Way It Does

The GPU is the central hardware artefact of the AI era, yet it is almost always treated as a black box — a thing that runs matrix multiplies and produces tokens. This essay opens the box: what lives inside an SM, why warps exist, how tensor cores achieve their throughput, and what the architecture's design choices mean for AI workloads specifically.

By Manish KL · April 2026 · ~19 min read · Hardware Essay
Abstract. A modern NVIDIA GPU like the H100 contains 132 Streaming Multiprocessors (SMs), each of which is itself a significant compute system. This essay builds up the GPU architecture from first principles: starting with the problem that led to massively parallel design, explaining how warps solve the memory latency problem through latency hiding, how tensor cores achieve 10–20× the throughput of CUDA cores for matrix operations, and what the register file, shared memory, and L1 cache inside each SM are actually used for. The goal is to make the GPU's architecture comprehensible — not as a list of specifications, but as a set of design decisions that follow logically from the workload.
132SMs in H100 SXM5 — each an independent multiprocessor with its own schedulers, cores, and SRAM
32threads in a warp — the atomic unit of GPU execution, always dispatched and executed together
989 TFLOP/sH100 FP8 tensor core peak — vs. ~60 TFLOP/s for FP32 CUDA cores on the same chip
256 KBregister file per SM in H100 — the largest, fastest on-chip storage, and most architecturally critical
Contents
  1. Why the GPU looks nothing like a CPU
  2. The Streaming Multiprocessor: the GPU's unit of compute
  3. Warps: how the GPU hides memory latency
  4. The register file: the fastest memory you've never thought about
  5. Shared memory and L1 cache: programmer-visible SRAM
  6. CUDA cores: FP32 and INT32 execution units
  7. Tensor cores: what they are and how they achieve 16× throughput
  8. Why tensor cores were the architectural unlock for AI
  9. Occupancy: the real metric of GPU utilisation
  10. The H100 SM by the numbers
  11. Where GPU architecture hits its limits for AI inference

1. Why the GPU looks nothing like a CPU

A modern CPU core — an Intel Raptor Cove, an ARM Cortex-X4 — devotes the majority of its transistor budget to managing complexity: out-of-order execution buffers, branch predictors, large L1/L2/L3 caches, instruction prefetch queues, speculative execution machinery. These are not compute engines — they are mechanisms to make one thread of sequential code run as fast as possible by predicting what it will do next and pre-executing it.

A GPU core makes the opposite trade. It eliminates nearly all complexity-management hardware. No deep out-of-order buffers. No branch prediction. Minimal caches. In its place: an enormous number of simple execution units and a warp-based threading model designed to tolerate memory latency by switching between hundreds of in-flight threads instead of predicting around it.

This is not the GPU being unsophisticated. It is the GPU making a deliberate trade optimised for a specific class of workload: tasks with massive data parallelism, where the same operation is applied to thousands of independent data elements simultaneously, and where memory latency is unavoidable but can be hidden by keeping enough work in flight.

The CPU optimises for latency of a single thread. The GPU optimises for throughput of thousands of threads. Both are correct — they serve different workloads. The key is understanding which one you have.

2. The Streaming Multiprocessor: the GPU's unit of compute

The Streaming Multiprocessor (SM) is the fundamental building block of the GPU. Every GPU is a collection of SMs connected to an HBM interface and an L2 cache. The H100 SXM5 has 132 SMs. The H200 SXM5 has the same 132 SMs with more HBM.

Each SM is itself a complete, independent compute system with its own instruction scheduler, register file, CUDA cores, tensor cores, special function units, shared memory, and L1 cache. SMs run independently and can execute entirely different programs simultaneously — this is what makes GPUs suitable for running different kernels concurrently.

H100 SM internal structure (Hopper architecture) Streaming Multiprocessor (SM) — 1 of 132 in H100 SXM5 L1 / Shared Memory 256 KB configurable L1 cache + shared SRAM Register File 256 KB (65,536 × 32-bit) fastest on-chip storage 4 Warp Schedulers each issues 1 instruction/cycle up to 64 warps in flight per SM 128 FP32 CUDA Cores 32-bit float multiply-add 64 INT32 cores alongside 4 Tensor Core Units 4th gen (Hopper) — FP8/16/TF32 MMA: D = A×B + C per cycle 16 SFUs + 32 LD/ST sin, cos, sqrt, rcp load/store to L1/shared Asynchronous Copy Engine (Hopper) Tensor Memory Accelerator (TMA) — moves data HBM→shared mem without occupying warp slots 50 MB L2 Cache (shared across all 132 SMs) ~12 TB/s aggregate bandwidth to SMs · slice-banked across GPU · feeds from HBM3e L2 miss → 3.35 TB/s HBM · L2 hit → ~12 TB/s on-die
Fig 1. Internal structure of one H100 SM (Hopper architecture). The SM contains four warp schedulers, 128 FP32 CUDA cores, four 4th-generation tensor core units, 16 SFUs, 32 LD/ST units, a 256 KB register file, 256 KB of configurable L1/shared memory, and a Tensor Memory Accelerator for async HBM→shared copies. The L2 cache (50 MB total, ~380 KB per SM) is shared across all SMs.

3. Warps: how the GPU hides memory latency

A warp is a group of 32 threads that execute together in lockstep — the same instruction, applied to 32 different data elements simultaneously. This is Single Instruction Multiple Threads (SIMT) execution. The warp is the atomic unit of GPU scheduling: you cannot schedule part of a warp, and all 32 threads in a warp execute the same instruction at the same time.

Why 32? This reflects the width of the execution units. A GPU's CUDA cores are grouped in sets of 32 (a "lane" per thread), so one warp's worth of FP32 additions — 32 simultaneous multiplications — fills the execution unit naturally. It also reflects the register file organisation: registers are banked in 32-thread groups for simultaneous access.

The critical insight is what warps do when they encounter a memory access — a load from HBM or L2 that takes hundreds of clock cycles to return. On a CPU, this causes a cache miss stall: the pipeline freezes and waits. A GPU does something entirely different:

1
Warp issues a memory load instruction. The load is dispatched to the memory subsystem. The warp cannot proceed to the next instruction until the data returns (typically 200–800 cycles for an HBM miss).
2
The warp scheduler does not wait. Instead of stalling the SM, the scheduler marks this warp as "waiting for memory" and immediately selects another warp that has all its operands ready. That warp's next instruction is dispatched in the very next cycle.
3
Latency is hidden behind other work. If enough warps are resident in the SM (a condition called high occupancy), by the time the scheduler has cycled through all ready warps, the original memory load has returned and the stalled warp is ready to continue. The SM never actually stalls — it always has something to do.

This is called latency hiding through thread-level parallelism, and it is the GPU's fundamental answer to slow memory. A CPU uses a different strategy (caches and prefetch) but a GPU simply keeps so many threads in flight that memory latency becomes invisible.

Warp occupancy and latency hiding — H100 SM
HBM access latency: ~700 cycles (round trip)
SM warp execution rate: 4 warps issued per cycle (4 schedulers)
Warps needed to hide latency = 700 cycles / 1 warp·cycle × 4 issue slots
≈ 28 warps needed to keep SM fully occupied during HBM miss

H100 SM supports: up to 64 resident warps
→ With 64 warps, even at ~700 cycle HBM latency, the SM can hide all memory stalls.
→ With fewer warps (low occupancy), stalls become visible.

4. The register file: the fastest memory you've never thought about

The register file is the fastest memory in the GPU — faster than L1 cache, faster than shared memory, faster than anything else on the chip. It is also enormous by cache standards: the H100 SM has a 256 KB register file, providing 65,536 individual 32-bit registers. These registers are the working memory for all the warps resident in the SM.

The reason the register file is so large is that all resident warps share it. If the SM has 64 warps resident and each warp's 32 threads uses 128 registers each: 64 × 32 × 128 = 262,144 registers — exactly the SM's capacity. This is not a coincidence. The register file is sized to hold the register state of all possible resident warps simultaneously, so that context-switching between warps costs zero cycles — there is no save/restore, no register spill. The warp's state is always live in the register file.

This zero-cost context switching is what makes warp-based latency hiding efficient. Unlike CPU thread switching, which requires saving registers to memory (hundreds of cycles), GPU warp switching takes one cycle — just a pointer to a different warp's register file bank.

5. Shared memory and L1 cache: programmer-visible SRAM

The H100 SM has 256 KB of on-chip SRAM that serves as a unified pool for two distinct purposes: L1 cache (managed automatically by hardware) and shared memory (managed explicitly by the programmer or compiler). The split between them is configurable per kernel — a programmer can allocate up to 228 KB as shared memory, leaving the remainder as L1 cache.

Shared memory is the closest thing a GPU has to a software-managed scratchpad. It allows threads within the same thread block (a group of up to 1,024 threads spanning multiple warps on the same SM) to communicate through fast on-chip SRAM without going to HBM. For tiled matrix multiplication — the core operation of transformer attention and FFN layers — shared memory is used to load tiles of the A and B matrices from HBM into SRAM, perform the multiply-accumulate, and then move to the next tile. This tile-based approach is what allows GPU matrix multiply to reuse data in SRAM for multiple operations rather than reading it fresh from HBM each time.

6. CUDA cores: FP32 and INT32 execution units

CUDA cores are the general-purpose arithmetic units of the SM — each one can perform one 32-bit floating-point multiply-add (FMA) per clock cycle. With 128 FP32 CUDA cores per SM and 132 SMs in the H100:

H100 FP32 CUDA core peak throughput
FP32 CUDA cores = 128 per SM × 132 SMs = 16,896 CUDA cores
2 ops per FMA (multiply + add) × 16,896 cores × 1,979 MHz clock
= ~66.9 TFLOP/s FP32 (CUDA cores only)

Compare: H100 FP16 tensor cores: ~989 TFLOP/s
→ Tensor cores deliver ~14.8× more throughput than CUDA cores for matrix operations.

CUDA cores are general-purpose: they can execute any FP32 or INT32 instruction, including non-matrix operations like element-wise activations (ReLU, GELU), normalisation, and control flow. They are the fallback for anything that doesn't fit the tensor core's specific matrix-multiply-accumulate operation.

7. Tensor cores: what they are and how they achieve 16× throughput

Tensor cores are specialised compute units that perform one operation that CUDA cores cannot: a matrix multiply-accumulate (MMA) on entire tiles of matrices in a single instruction. A 4th-generation Hopper tensor core performs the operation:

D = A × B + C

where A, B, C, D are small matrices — for FP8, the tile size is 16×16×16 (a 16×16 result from a 16×16 × 16×16 multiplication). In one clock cycle, one tensor core unit computes 16 × 16 × 16 × 2 = 8,192 FP8 multiply-add operations. Compare this to a CUDA core, which performs 2 FP32 operations per cycle.

Tensor core MMA operation — D = A×B + C A 16 × 16 (FP8) 256 elements loaded from register file × B 16 × 16 (FP8) weight tile from shared memory + C 16 × 16 (FP32) accumulator higher precision = D 16 × 16 result 8,192 FMA ops in 1 clock cycle per tensor core unit CUDA core: 2 FP32 ops/cycle Tensor core: 8,192 FP8 ops/cycle 4,096× more ops Tensor cores perform matrix tiles in hardware; CUDA cores perform one FMA at a time.
Fig 2. Tensor core MMA operation: D = A×B + C, where all matrices are 16×16 tiles. One tensor core unit performs 8,192 FP8 multiply-add operations per clock cycle by executing the full tile multiply in dedicated silicon rather than computing one element at a time. Accumulation is done in higher-precision FP32 to avoid numerical degradation.

The key distinction: a CUDA core is a scalar unit. It adds two numbers or multiplies two numbers. A tensor core is a matrix unit — it operates on an entire tile of a matrix simultaneously. The silicon is organised differently: instead of a general-purpose ALU, a tensor core contains a fixed-function array of multiply-accumulate cells arranged in a matrix pattern, with dedicated wiring to fan operands across the array in parallel.

The accumulator precision deserves attention. Although the input matrices A and B can be FP8 (8-bit floating point), the accumulator C and result D are kept in FP32. This is deliberate: 8-bit arithmetic has limited range and loses precision quickly when summing many products. By accumulating in 32-bit, the tensor core trades some throughput reduction in the accumulation step for numerical stability in the final result. The input data movement is FP8 (fewer bytes to read from memory), while the arithmetic accuracy is FP32.

8. Why tensor cores were the architectural unlock for AI

Before tensor cores (pre-Volta, 2017), running a neural network on a GPU meant using CUDA cores to do scalar matrix multiplication — one element at a time. A matrix multiply of two 4096×4096 matrices required 4096³ = 68.7 billion FMA operations, each executed serially through the CUDA core pipeline. The throughput was limited by the number of CUDA cores and their clock rate.

With tensor cores, the same matrix multiply is decomposed into tiles of 16×16 matrices, each tile computed in a single tensor core operation. The arithmetic intensity per clock cycle jumps by over 4,000×. The H100's 989 TFLOP/s FP8 peak is almost entirely tensor core throughput — CUDA cores contribute only ~67 TFLOP/s.

This is why the H100's AI benchmark numbers are so much higher than its FP32 FLOP count suggests. The FP32 FLOP count reflects CUDA core capacity. The AI throughput reflects tensor core capacity. They are different silicon doing different work.

GPUArchitectureFP32 CUDA cores (TFLOP/s)FP16 Tensor (TFLOP/s)FP8 Tensor (TFLOP/s)TC / CUDA ratio
V100Volta (2017)14112
A100Ampere (2020)19.531216×
H100Hopper (2022)66.949498914.8×
B200Blackwell (2025)~90~1,800~4,500~50×

9. Occupancy: the real metric of GPU utilisation

GPU utilisation as reported by nvidia-smi is often misleading for AI workloads. A GPU can report 100% utilisation while its tensor cores are actually idle — the utilisation metric reflects whether the GPU is executing any instructions, not whether its most powerful execution units are busy.

The more meaningful metric is occupancy — the ratio of active warps to the maximum possible warps per SM. Low occupancy means the SM has fewer warps in flight than it needs to hide memory latency, and stalls become visible. High occupancy means memory latency is hidden and execution units are kept busy.

Occupancy is limited by three resources, and the most constrained of the three determines the ceiling:

1
Register file pressure. If a kernel uses many registers per thread, fewer total threads fit in the register file. A kernel using 128 registers per thread at full 64-warp occupancy would need 64 × 32 × 128 = 262,144 registers — exactly the H100 SM's limit. A kernel using 255 registers per thread could only achieve 32 warps (50% occupancy) before the register file is full.
2
Shared memory allocation. If a kernel allocates a large shared memory buffer per thread block, fewer thread blocks fit per SM simultaneously. A kernel using 200 KB of shared memory can only run one thread block per SM (since the SM has 256 KB total), limiting occupancy to however many warps that one block contains.
3
Thread block configuration. The SM can host up to 32 thread blocks simultaneously, but they must collectively stay within the register and shared memory limits. The configuration of threads-per-block is a tunable parameter that affects how these resources are partitioned.

The practical consequence for AI: matrix multiply kernels (FlashAttention, cuBLAS GEMM) are carefully hand-tuned to maximise occupancy by controlling tile sizes, register usage, and shared memory allocation. A poorly tuned kernel can achieve 30% of peak throughput on the same hardware as an optimised kernel achieving 85%.

10. The H100 SM by the numbers

ResourceH100 SXM5Per SMAI implication
SMs132132 independent compute units; run different thread blocks simultaneously
CUDA cores (FP32)16,896128Scalar ops: activations, normalisation, non-matrix work
Tensor cores (4th gen)5284Matrix tiles: attention GEMM, FFN GEMM — 989 TFLOP/s FP8
Register file~33.5 MB256 KBZero-latency, zero-cost warp context switching
L1 / Shared memory~33.5 MB256 KBTile staging for matrix ops; inter-thread communication
L2 cache50 MB~380 KBCatches repeated weight reads; reduces HBM traffic
Max warps per SM8,44864Latency hiding capacity
Max threads per SM270,3362,04864 warps × 32 threads
Warp schedulers52844 instructions issued per SM per cycle

11. Where GPU architecture hits its limits for AI inference

The GPU's architecture is extraordinarily well-matched to one phase of AI inference: prefill. Prefill — processing the full input prompt — is a large batch matrix multiply: a matrix of input activations against the model's weight matrices. Tensor cores exist precisely for this operation. Prefill at 989 TFLOP/s is the GPU operating in its designed regime.

Decode — generating tokens one at a time — is the opposite. At batch size 1, each decode step is a matrix-vector multiply, not a matrix-matrix multiply. A matrix-vector multiply does not fill the tensor cores efficiently: the "B matrix" has only one column. The tensor core tile is 16×16, but the batch is 1×16 — 15 of the 16 tile rows are wasted. Arithmetic intensity collapses to ~1.6 FLOP/byte. The GPU reverts to waiting for HBM at 3.35 TB/s.

This is the fundamental architectural mismatch between GPU design and autoregressive inference: the tensor cores that give the GPU its headline throughput numbers are structurally underutilised during the token generation phase that dominates real serving workloads. The SM, the warp scheduler, the register file — all of this infrastructure exists to support massive parallelism, but decode is an inherently serial workload where the next token depends on the previous one.

The GPU's tensor cores are a remarkable engineering achievement — and they are nearly irrelevant during autoregressive decode. Prefill uses them well. Decode doesn't. And decode is where users wait. This architectural reality is the deepest motivation for every alternative inference accelerator architecture: wafer-scale SRAM chips, near-memory compute, and specialised decode accelerators that trade tensor core throughput for memory bandwidth.

What this essay does not cover
This essay covers the H100 Hopper SM architecture. It does not cover: the Blackwell architecture (B100/B200) which introduces 5th-generation tensor cores and a new FP4 precision mode; the specific microcode and instruction encoding of GPU ISAs (PTX and SASS); multi-instance GPU (MIG) partitioning; NVLink topology and the NVSwitch fabric; or the software model (CUDA programming model, thread blocks, grid dimensions) in detail. Each of those deserves its own treatment.