GPU Practice Codes
Practice GPU kernels with CUDA for common operations.
Code | Preprocessor Flags | Brief description | ||
---|---|---|---|---|
matrix multiplication | NAIVE | simplest, high latency | ||
TILED | 2D tiling | |||
TILED_COARSENED | 2D tiling + thread coarsening | |||
convolution | NAIVE | simplest, high latency | ||
CONSTMEM | constant memory for filter | |||
TILED_CONSTANTMEM_TYPE_1 | CONSTMEM + 2D tiling with more threads than output tile | |||
TILED_CONSTANTMEM_TYPE_2 | CONSTMEM + 2D tiling with fewer threads than input tile | |||
TILED_CONSTANTMEM_CACHEHALO | TILED_CONSTANTMEM_TYPE_2 with halo cells loaded from memory (L2 cache) | |||
transpose | NAIVE | simplest, thread coarsening, high latency | ||
CORNER_TURNING | thread coarsening + tiling with corner turning | |||
NO_BANK_CONFLICT | CORNER_TURNING without bank conflicts | |||
image rotation | NAIVE | simplest, thread coarsening, high latency | ||
CORNER_TURNING | thread coarsening + similar to corner turning without bank conflicts | |||
stencil (need cleanup) | NAIVE | simplest, high latency | ||
TILED | 2D tiling | |||
THREADCOARSENING | 2D tiling + thread coarsening | |||
REGISTERTILING_THREADCOARSENING | 2D tiling with smart use of registers | |||
streams | - | overlapping kernel with cudaMemcpyAsync | ||
histogram | NAIVE | simplest, high contention | ||
PRIVATIZATION | privatization to reduce contention | |||
THREADCOARSENING_CONTIGUOUS | PRIVATIZATION + threadcoarsening with contiguous accesses per thread | |||
THREADCOARSENING_INTERLEAVED | PRIVATIZATION + threadcoarsening with interleaved accesses for better memory coalescing | |||
AGGREGATION | THREADCOARSENING_INTERLEAVED + with accumulator (best for non-uniform distribution of data) | |||
reduction | NAIVE | simplest, single block, control divergence | ||
CONVERGENT | single block, no control divergence | |||
SHAREDMEM | CONVERGENT + use of shared memory | |||
HIERARCHICAL | SHAREDMEM + multiple blocks per grid | |||
THREADCOARSENING | HIERARCHICAL + thread coarsening | |||
prefix sum | KOGG_STONE | Kogg-Stone algorithm | ||
KOGG_STONE_DOUBLE_BUFFER | KOGG_STONE with double buffering technique | |||
BRENT_KUNG | Brent-Kung algorithm |
More example will be added in time.
References
- Programming Massively Parallel Processors: A Hands-On Approach (4th edn.) by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
- CUDA Training Series organized by NVIDIA and Oak Ridge National Laboratory
- NVIDIA blogs
Enjoy Reading This Article?
Here are some more articles you might like to read next: