GPU Practice Codes

Practice GPU kernels with CUDA for common operations.

Code   Preprocessor Flags   Brief description
matrix multiplication   NAIVE   simplest, high latency
    TILED   2D tiling
    TILED_COARSENED   2D tiling + thread coarsening
convolution   NAIVE   simplest, high latency
    CONSTMEM   constant memory for filter
    TILED_CONSTANTMEM_TYPE_1   CONSTMEM + 2D tiling with more threads than output tile
    TILED_CONSTANTMEM_TYPE_2   CONSTMEM + 2D tiling with fewer threads than input tile
    TILED_CONSTANTMEM_CACHEHALO   TILED_CONSTANTMEM_TYPE_2 with halo cells loaded from memory (L2 cache)
transpose   NAIVE   simplest, thread coarsening, high latency
    CORNER_TURNING   thread coarsening + tiling with corner turning
    NO_BANK_CONFLICT   CORNER_TURNING without bank conflicts
image rotation   NAIVE   simplest, thread coarsening, high latency
    CORNER_TURNING   thread coarsening + similar to corner turning without bank conflicts
stencil (need cleanup)   NAIVE   simplest, high latency
    TILED   2D tiling
    THREADCOARSENING   2D tiling + thread coarsening
    REGISTERTILING_THREADCOARSENING   2D tiling with smart use of registers
streams   -   overlapping kernel with cudaMemcpyAsync
histogram   NAIVE   simplest, high contention
    PRIVATIZATION   privatization to reduce contention
    THREADCOARSENING_CONTIGUOUS   PRIVATIZATION + threadcoarsening with contiguous accesses per thread
    THREADCOARSENING_INTERLEAVED   PRIVATIZATION + threadcoarsening with interleaved accesses for better memory coalescing
    AGGREGATION   THREADCOARSENING_INTERLEAVED + with accumulator (best for non-uniform distribution of data)
reduction   NAIVE   simplest, single block, control divergence
    CONVERGENT   single block, no control divergence
    SHAREDMEM   CONVERGENT + use of shared memory
    HIERARCHICAL   SHAREDMEM + multiple blocks per grid
    THREADCOARSENING   HIERARCHICAL + thread coarsening
prefix sum   KOGG_STONE   Kogg-Stone algorithm
    KOGG_STONE_DOUBLE_BUFFER   KOGG_STONE with double buffering technique
    BRENT_KUNG   Brent-Kung algorithm


More example will be added in time.

References




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Exploring Policy-Based Design: A Customizable Message Logger in C++
  • GoF Design Patterns: A Brief Overview
  • C++ Template Basics
  • Exploring Type Erasure as a Design Pattern: A Generic Materials Solver
  • Just-In-Time Compiled CUDA Kernel