Warp & Thread Groups: Field Notes on GPU Execution

Sheet 01 — Scale

Three Levels of Parallelism

Ask someone what makes a GPU fast and they'll say "thousands of cores." True, but misleading — a GPU doesn't run thousands of independent threads. It runs them in tightly coordinated squads, and that organization is the difference between code that merely runs on a GPU and code that actually uses one.

NVIDIA calls the basic execution unit a warp. AMD calls it a wavefront. HLSL and DirectCompute call the layer above it a thread group; CUDA calls the same idea a thread block. Different vocabulary, same hierarchy:

Grid / Dispatch

Thread Group / Block — programmer-defined

Warp / Wavefront — hardware-fixed, 32–64 threads

Thread — one lane, one instruction stream shared

Grid → independent work spread across the whole chip. Thread group → your scheduling unit, shares memory and barriers. Warp → the hardware's true atomic unit of execution.

Sheet 02 — Definition

What Exactly Is a Warp?

A warp is NVIDIA's unit of SIMT execution — Single Instruction, Multiple Threads. All 32 threads in a warp share one instruction stream. Every cycle, the scheduler issues the same instruction to all 32 lanes; each lane just applies it to its own data.

This is the key mental model: a warp is not 32 independent threads that happen to run nearby. It's one instruction stream wearing 32 hats. If you've used SIMD on CPUs — SSE, AVX, NEON — the analogy is exact, except a GPU runs hundreds of these lane-groups concurrently and hides memory latency by switching between them.

32NVIDIA warp width

64AMD GCN wavefront

32/64AMD RDNA, configurable

Portable code shouldn't hardcode the number — query it. WaveGetLaneCount() in HLSL, warpSize in CUDA.

Sheet 03 — Definition

What Exactly Is a Thread Group?

A thread group is the unit you define when writing a compute shader or kernel. The hardware doesn't know or care about your chosen shape — it flattens it into a sequence of threads and slices that into warps of 32 (or wavefronts of 64).

Examplecompute.hlsl

[numthreads(8, 8, 1)]
void CSMain(
    uint3 groupID      : SV_GroupID,
    uint3 groupThreadID : SV_GroupThreadID,
    uint3 dispatchID    : SV_DispatchThreadID)
{
    // 8 * 8 * 1 = 64 threads / group
    // → flattened into 2 warps of 32
}

Examplekernel.cu

dim3 blockDim(8, 8, 1);   // 64 threads / block
dim3 gridDim(16, 16, 1);

myKernel<<<gridDim, blockDim>>>(data);

// inside the kernel:
int tid = threadIdx.x + blockIdx.x * blockDim.x;

What a thread group guarantees that a lone warp does not:

Shared memory — a fast on-chip scratchpad (groupshared / __shared__) visible to every thread in the group.
Group-wide synchronization — a barrier where every thread waits for the others, so shared-memory writes are visible before anyone reads them.

Warps in different thread groups generally can't talk to each other — there's no hardware guarantee about when other groups even run. That isolation is intentional: it's what lets the GPU schedule thousands of groups without a global coordination bottleneck.

Sheet 04 — Composition

How Warps Compose a Group

Request a 256-thread group and the hardware doesn't hand you 256 independent lanes — it hands you 8 warps that share one shared-memory allocation and one barrier.

Thread Group — 256 threads

Warp 0

Warp 1

Warp 2

Warp 3

Warp 4

Warp 5

Warp 6

Warp 7

Each warp is scheduled independently by the streaming multiprocessor. Warp 3 might stall on a memory load while Warp 6 keeps computing — that interleaving is how the GPU hides memory latency: when one warp stalls, the scheduler swaps in another that's ready, at essentially zero cost, since each warp owns its own register file.

More resident warps means more places to hide a stall behind.

Sheet 05 — Failure Mode

Divergence: When Warps Misbehave

Because every thread in a warp executes the same instruction, trouble starts the moment threads in a warp disagree about which instruction to run — typically an if/else on per-thread data.

Exampledivergent.hlsl

if (threadID % 2 == 0) {
    DoExpensiveWorkA();   // even lanes
} else {
    DoExpensiveWorkB();   // odd lanes
}

A warp can't actually split into two streams. The hardware executes both branches serially, masking off the threads that don't apply to each one. The warp pays the combined cost of A and B — correctness is fine, throughput quietly halves.

Warp lanes

Step 1 — branch A

Step 2 — branch B

Masked lanes still occupy a cycle — they just don't write a result. Total cost ≈ cost(A) + cost(B), even though each lane only needed one.

The fix isn't "never branch" — it's awareness of where the boundary falls relative to warp boundaries. A branch that's uniform across the warp (e.g. based on groupID rather than per-thread data) costs nothing extra, because the whole warp takes the same path together.

Sheet 06 — Coordination

Synchronization & Shared Memory

The barrier at the thread-group level exists specifically because warps run asynchronously relative to each other. Load a tile into shared memory, then have every thread read from it — that requires a barrier in between.

Exampletile_matmul.cu

__shared__ float tile[16][16];

tile[ty][tx] = globalData[...];   // each thread loads one element
__syncthreads();              // wait for the WHOLE group — all warps

// now safe to read any element, not just your own
sum += tile[ty][k] * tile[k][tx];

Without the barrier, a fast warp could start reading the tile before a slow warp finishes writing its portion — a race condition that fails intermittently, which makes it miserable to debug, since it depends on scheduling rather than logic. Note what the barrier doesn't do: it never synchronizes across thread groups. There's no cheap, general way for Group 5 to wait on Group 12, by design.

Sheet 07 — Cross-Reference

Warp-Level Primitives

Because a warp already moves in lockstep, vendors expose intrinsics that let threads inside it exchange data without touching shared memory or paying a barrier — far cheaper than a full group sync.

Cross-reference — vendor terminology
Operation	CUDA	HLSL (SM6+)
Sum across the warp	__reduce_add_sync / shuffle reduction	WaveActiveSum
Broadcast a value	__shfl_sync	WaveReadLaneFirst
Any / all lanes true	__any_sync / __all_sync	WaveActiveAnyTrue / WaveActiveAllTrue
Bitmask of active lanes	__ballot_sync	WaveActiveBallot
Execution unit, NVIDIA	warp (32 threads)	wave (32 lanes)
Cooperative container	thread block	thread group

If you find yourself writing a tree-reduction over shared memory that only spans 32 elements, that's almost always better expressed as a single warp-level shuffle reduction instead.

Sheet 08 — Specification

Choosing a Thread Group Size

This is the practical question every GPU programmer eventually answers, and the warp relationship is exactly what should drive the decision.

Item	Rule	Rationale
01	Size as a multiple of warp width	100 threads round up to 4 warps (128 lanes) — 28 lanes wasted on every instruction. 128 or 256 wastes nothing.
02	Don't go smaller than ~64	A 32-thread group is exactly one warp — correct, but gives the scheduler nothing to interleave for latency hiding.
03	Don't oversize carelessly	Bigger groups raise shared-memory and register pressure, which can shrink how many groups fit per SM — lowering occupancy.
04	Query the warp size, don't hardcode it	Code that assumes 32 wastes a quarter of every lane group on 64-wide AMD hardware.

Common sweet spots: 64 · 128 · 256 · 512 — the right one depends on your shared-memory footprint and register usage. Profile for occupancy; don't guess.

Sheet 09 — Closing

Why the Hierarchy Exists

It's worth asking why GPU vendors didn't just give every thread a fully independent instruction stream, the way a CPU core does. The answer is area and power: an independent fetch/decode unit per thread would be enormously expensive to replicate thousands of times over.

By sharing one instruction stream across 32 or 64 threads, the warp amortizes that cost across many execution lanes, freeing up silicon for more ALUs instead of more control logic. The thread group, in turn, is the software contract that lets you exploit locality without requiring the hardware to coordinate across the entire chip.

The constraints that feel limiting at first — lockstep execution, divergence penalties, no cross-group sync — are the same constraints that make GPUs so much more power-efficient than CPUs at the regular, data-parallel work they're built for. Think in warps and thread groups, and GPU performance stops feeling mysterious.

The warp is the GPU's true atomic unit of execution. The thread group is the cooperative unit you build on top of it.

The next time a compute shader runs slower than expected, ask two questions first: is this thread group sized as a clean multiple of the warp width, and is there a branch where threads in the same warp disagree? More often than not, the answer to one of those explains the gap.

Warp & ThreadGroups