CUDA Study Library

PMPP CUDA Study

150 CUDA examples: 100 core PMPP-style examples + 50 advanced studies.

Run, validate, inspect, optimize.

Built as a study library rather than a sample dump: deterministic inputs, CPU reference checks, comments around mapping and memory behavior, and advanced follow-on studies that build on the core patterns instead of replacing them.

Core track 001-100
Advanced studies 101-150
Current maturity 110 implemented

Why This Repo

Learn CUDA as a progression of patterns

Foundations first

Start with indexing, simple transforms, stencils, reduction, scan, histograms, and dense kernels before jumping into more irregular work.

Correctness before speed

Examples are built to compile, validate, and explain. CPU references, deterministic inputs, and explicit PASS or FAIL output keep the study path grounded.

Advanced studies with structure

The `101-150` range is a second track for warp-level programming, memory studies, sparse / graph kernels, simulation, and practical ML operators.

Learning Path

Core curriculum, then advanced studies

001-020

Foundations

Kernel launches, indexing, simple transforms, stencils, and early shared-memory structure.

021-040

Parallel Patterns

Reduction, scan, histogram, compaction, gather, scatter, sorting, and search-oriented kernels.

041-060

Linear Algebra

Matrix-vector multiply, naive and tiled GEMM, convolution, sparse operators, and iterative methods.

061-075

Image / Signal

Resize, filtering, edge detection, signal processing, and encoding-style kernels.

076-100

Advanced Workloads

Simulation, rendering, graph traversal, clustering, and ML-oriented kernels in the core track.

101-150

Advanced Studies

Warp primitives, memory studies, sparse / irregular kernels, imaging, simulation, and practical ML operators.

Advanced Track

Five focused groups in `101-150`

Warp / atomics / scan

`101-110` covers segmented primitives, warp aggregation, and filtering-style kernels.

3 implemented / 10 total

Memory / tiling / optimization

`111-120` studies warp shuffle, bank conflicts, coalescing, transpose ladders, and halo tiles.

5 implemented / 10 total

Sparse / graph / irregular

`121-130` is reserved for sparse formats, graph frontiers, hashing, and irregular access patterns.

0 implemented / 10 total

Imaging / simulation

`131-140` extends the core image and simulation path with more optimization-heavy studies.

2 implemented / 10 total

ML / practical kernels

`141-150` turns core reduction and tiling ideas into normalization, softmax, pricing, and inference studies.

4 implemented / 10 total

Best Starting Points

Three useful entry routes

Best first 5 examples

  • 002 Vector Addition
  • 020 Matrix Transpose with Shared Memory
  • 023 Sum Reduction
  • 043 Tiled Matrix Multiply
  • 111 Warp Shuffle Reduction

Best for optimization study

  • 026 / 027 Naive vs work-efficient scan
  • 028 / 029 Global atomics vs privatization
  • 042 / 043 Naive vs tiled matrix multiply
  • 116 Bank Conflict Study
  • 117 Coalescing Study

Best for interview prep / CUDA patterns

  • 023 Reduction and synchronization
  • 027 Scan tree reasoning
  • 029 Privatization and contention
  • 111 Warp-synchronous reduction
  • 142 Stable softmax

How To Study This Repo

A practical workflow for each example

1

Run

Build the baseline and execute the example with deterministic inputs.

2

Validate

Read the CPU reference path and confirm the PASS or FAIL output first.

3

Inspect Mapping

Identify which thread owns each output and where synchronization is required.

4

Compare Versions

Study what the stronger version changes in reuse, memory traffic, divergence, or contention.

5

Benchmark

Only then look at timing, throughput, and scaling behavior.