CUDA Study Library

PMPP CUDA Study

150 CUDA examples: 100 core PMPP-style examples + 50 advanced studies.

Run, validate, inspect, optimize.

Built as a study library rather than a sample dump: deterministic inputs, CPU reference checks, comments around mapping and memory behavior, and advanced follow-on studies that build on the core patterns instead of replacing them.

View GitHub Repo Explore Featured Examples

Core track 001-100

Advanced studies 101-150

Current maturity 110 implemented

Core track Structured foundations through advanced workloads

001-020 Foundations

021-040 Patterns

041-060 Linear algebra

061-075 Image / signal

076-100 Advanced workloads

101-150 Advanced studies

Advanced studies Deeper optimization and practical kernels

Warp / atomics / scan

Memory / tiling / optimization

Sparse / graph / irregular

Imaging / simulation

ML / practical kernels

Why This Repo

Learn CUDA as a progression of patterns

Foundations first

Start with indexing, simple transforms, stencils, reduction, scan, histograms, and dense kernels before jumping into more irregular work.

Correctness before speed

Examples are built to compile, validate, and explain. CPU references, deterministic inputs, and explicit PASS or FAIL output keep the study path grounded.

Advanced studies with structure

The `101-150` range is a second track for warp-level programming, memory studies, sparse / graph kernels, simulation, and practical ML operators.

Learning Path

Core curriculum, then advanced studies

001-020

Foundations

Kernel launches, indexing, simple transforms, stencils, and early shared-memory structure.

021-040

Parallel Patterns

Reduction, scan, histogram, compaction, gather, scatter, sorting, and search-oriented kernels.

041-060

Linear Algebra

Matrix-vector multiply, naive and tiled GEMM, convolution, sparse operators, and iterative methods.

061-075

Image / Signal

Resize, filtering, edge detection, signal processing, and encoding-style kernels.

076-100

Advanced Workloads

Simulation, rendering, graph traversal, clustering, and ML-oriented kernels in the core track.

101-150

Advanced Studies

Warp primitives, memory studies, sparse / irregular kernels, imaging, simulation, and practical ML operators.

Featured Examples

A curated cross-section of the repo

002Foundations

Vector Addition

Clean host-device workflow with minimal algorithmic noise.

Why it matters: the clearest first CUDA example in the repo.

View on GitHub

023Reduction

Sum Reduction

Interleaved and lower-divergence reductions side by side.

Why it matters: reduction is the pattern that keeps returning later.

View on GitHub

029Privatization

Histogram with Shared Memory

Per-block privatization reduces global atomic pressure before the flush.

Why it matters: contention tradeoffs become visible instead of abstract.

View on GitHub

043Tiling

Tiled Matrix Multiply

Shared-memory tiles turn repeated global loads into reusable local data.

Why it matters: a classic example of reuse changing performance shape.

View on GitHub

111Warp

Warp Shuffle Reduction

Registers and warp shuffles replace part of the shared-memory tree.

Why it matters: a strong bridge from block-level to warp-level thinking.

View on GitHub

116Memory Study

Bank Conflict Study

Compares a conflict-prone transpose tile with a padded shared-memory layout.

Why it matters: the cost of the wrong layout is easy to reason about.

View on GitHub

120Stencil

Stencil with Halo Tiling

A tiled 2D stencil that stages center data and halo cells cooperatively.

Why it matters: shows shared-memory staging outside matrix multiply.

View on GitHub

131Imaging

Sobel Filter Optimized

A tiled edge detector with halo loads and explicit neighborhood reuse.

Why it matters: a practical imaging kernel with readable optimization structure.

View on GitHub

141ML

LayerNorm Forward

A row-wise normalization kernel built from shared reduction primitives.

Why it matters: familiar CUDA patterns reappear inside modern ML code.

View on GitHub

142ML

Softmax Stable

Stable row-wise softmax with a max reduction followed by denominator reduction.

Why it matters: numerical stability and synchronization both matter here.

View on GitHub

150Pipeline

Mini Inference Pipeline

Two small dense layers, ReLU, and softmax composed into one readable forward pass.

Why it matters: the repo now ends with a compact end-to-end kernel sequence.

View on GitHub

Advanced Track

Five focused groups in `101-150`

Warp / atomics / scan

`101-110` covers segmented primitives, warp aggregation, and filtering-style kernels.

3 implemented / 10 total

Memory / tiling / optimization

`111-120` studies warp shuffle, bank conflicts, coalescing, transpose ladders, and halo tiles.

5 implemented / 10 total

Sparse / graph / irregular

`121-130` is reserved for sparse formats, graph frontiers, hashing, and irregular access patterns.

0 implemented / 10 total

Imaging / simulation

`131-140` extends the core image and simulation path with more optimization-heavy studies.

2 implemented / 10 total

ML / practical kernels

`141-150` turns core reduction and tiling ideas into normalization, softmax, pricing, and inference studies.

4 implemented / 10 total

Best Starting Points

Three useful entry routes

Best first 5 examples

002 Vector Addition
020 Matrix Transpose with Shared Memory
023 Sum Reduction
043 Tiled Matrix Multiply
111 Warp Shuffle Reduction

Best for optimization study

026 / 027 Naive vs work-efficient scan
028 / 029 Global atomics vs privatization
042 / 043 Naive vs tiled matrix multiply
116 Bank Conflict Study
117 Coalescing Study

Best for interview prep / CUDA patterns

023 Reduction and synchronization
027 Scan tree reasoning
029 Privatization and contention
111 Warp-synchronous reduction
142 Stable softmax

Repo Maturity

Progress by track and study area

150total examples

110implemented

40scaffolded

Core PMPP-style track

96 / 100

The core path is nearly complete. The remaining scaffolds are the factorization-heavy `057-060` block.

Advanced studies

14 / 50

The advanced track now has its first implemented subset, with the rest scaffolded so the curriculum is visible before every topic is filled in.

Warp / atomics / scan

3 / 10

Memory / tiling / optimization

5 / 10

Sparse / graph / irregular

0 / 10

Imaging / simulation

2 / 10

ML / practical kernels

4 / 10

How To Study This Repo

A practical workflow for each example

Run

Build the baseline and execute the example with deterministic inputs.

Validate

Read the CPU reference path and confirm the PASS or FAIL output first.

Inspect Mapping

Identify which thread owns each output and where synchronization is required.

Compare Versions

Study what the stronger version changes in reuse, memory traffic, divergence, or contention.

Benchmark

Only then look at timing, throughput, and scaling behavior.