Foundations first
Start with indexing, simple transforms, stencils, reduction, scan, histograms, and dense kernels before jumping into more irregular work.
CUDA Study Library
150 CUDA examples: 100 core PMPP-style examples + 50 advanced studies.
Run, validate, inspect, optimize.
Built as a study library rather than a sample dump: deterministic inputs, CPU reference checks, comments around mapping and memory behavior, and advanced follow-on studies that build on the core patterns instead of replacing them.
Why This Repo
Start with indexing, simple transforms, stencils, reduction, scan, histograms, and dense kernels before jumping into more irregular work.
Examples are built to compile, validate, and explain. CPU references, deterministic inputs, and explicit PASS or FAIL output keep the study path grounded.
The `101-150` range is a second track for warp-level programming, memory studies, sparse / graph kernels, simulation, and practical ML operators.
Learning Path
Kernel launches, indexing, simple transforms, stencils, and early shared-memory structure.
Reduction, scan, histogram, compaction, gather, scatter, sorting, and search-oriented kernels.
Matrix-vector multiply, naive and tiled GEMM, convolution, sparse operators, and iterative methods.
Resize, filtering, edge detection, signal processing, and encoding-style kernels.
Simulation, rendering, graph traversal, clustering, and ML-oriented kernels in the core track.
Warp primitives, memory studies, sparse / irregular kernels, imaging, simulation, and practical ML operators.
Featured Examples
Clean host-device workflow with minimal algorithmic noise.
Why it matters: the clearest first CUDA example in the repo.
View on GitHubInterleaved and lower-divergence reductions side by side.
Why it matters: reduction is the pattern that keeps returning later.
View on GitHubPer-block privatization reduces global atomic pressure before the flush.
Why it matters: contention tradeoffs become visible instead of abstract.
View on GitHubShared-memory tiles turn repeated global loads into reusable local data.
Why it matters: a classic example of reuse changing performance shape.
View on GitHubRegisters and warp shuffles replace part of the shared-memory tree.
Why it matters: a strong bridge from block-level to warp-level thinking.
View on GitHubCompares a conflict-prone transpose tile with a padded shared-memory layout.
Why it matters: the cost of the wrong layout is easy to reason about.
View on GitHubA tiled 2D stencil that stages center data and halo cells cooperatively.
Why it matters: shows shared-memory staging outside matrix multiply.
View on GitHubA tiled edge detector with halo loads and explicit neighborhood reuse.
Why it matters: a practical imaging kernel with readable optimization structure.
View on GitHubA row-wise normalization kernel built from shared reduction primitives.
Why it matters: familiar CUDA patterns reappear inside modern ML code.
View on GitHubStable row-wise softmax with a max reduction followed by denominator reduction.
Why it matters: numerical stability and synchronization both matter here.
View on GitHubTwo small dense layers, ReLU, and softmax composed into one readable forward pass.
Why it matters: the repo now ends with a compact end-to-end kernel sequence.
View on GitHubAdvanced Track
`101-110` covers segmented primitives, warp aggregation, and filtering-style kernels.
3 implemented / 10 total`111-120` studies warp shuffle, bank conflicts, coalescing, transpose ladders, and halo tiles.
5 implemented / 10 total`121-130` is reserved for sparse formats, graph frontiers, hashing, and irregular access patterns.
0 implemented / 10 total`131-140` extends the core image and simulation path with more optimization-heavy studies.
2 implemented / 10 total`141-150` turns core reduction and tiling ideas into normalization, softmax, pricing, and inference studies.
4 implemented / 10 totalBest Starting Points
Repo Maturity
The core path is nearly complete. The remaining scaffolds are the factorization-heavy `057-060` block.
The advanced track now has its first implemented subset, with the rest scaffolded so the curriculum is visible before every topic is filled in.
How To Study This Repo
Build the baseline and execute the example with deterministic inputs.
Read the CPU reference path and confirm the PASS or FAIL output first.
Identify which thread owns each output and where synchronization is required.
Study what the stronger version changes in reuse, memory traffic, divergence, or contention.
Only then look at timing, throughput, and scaling behavior.