GPU Practice Codes | Saurabh S. Sawant

Practice GPU kernels with CUDA for common operations.

Code	Preprocessor Flags	Brief description
matrix multiplication	`NAIVE`	simplest, high latency
	`TILED`	2D tiling
	`TILED_COARSENED`	2D tiling + thread coarsening
convolution	`NAIVE`	simplest, high latency
	`CONSTMEM`	constant memory for filter
	`TILED_CONSTANTMEM_TYPE_1`	`CONSTMEM` + 2D tiling with more threads than output tile
	`TILED_CONSTANTMEM_TYPE_2`	`CONSTMEM` + 2D tiling with fewer threads than input tile
	`TILED_CONSTANTMEM_CACHEHALO`	`TILED_CONSTANTMEM_TYPE_2` with halo cells loaded from memory (L2 cache)
transpose	`NAIVE`	simplest, thread coarsening, high latency
	`CORNER_TURNING`	thread coarsening + tiling with corner turning
	`NO_BANK_CONFLICT`	`CORNER_TURNING` without bank conflicts
image rotation	`NAIVE`	simplest, thread coarsening, high latency
	`CORNER_TURNING`	thread coarsening + similar to corner turning without bank conflicts
stencil (need cleanup)	`NAIVE`	simplest, high latency
	`TILED`	2D tiling
	`THREADCOARSENING`	2D tiling + thread coarsening
	`REGISTERTILING_THREADCOARSENING`	2D tiling with smart use of registers
streams	-	overlapping kernel with cudaMemcpyAsync
histogram	`NAIVE`	simplest, high contention
	`PRIVATIZATION`	privatization to reduce contention
	`THREADCOARSENING_CONTIGUOUS`	`PRIVATIZATION` + threadcoarsening with contiguous accesses per thread
	`THREADCOARSENING_INTERLEAVED`	`PRIVATIZATION` + threadcoarsening with interleaved accesses for better memory coalescing
	`AGGREGATION`	`THREADCOARSENING_INTERLEAVED` + with accumulator (best for non-uniform distribution of data)
reduction	`NAIVE`	simplest, single block, control divergence
	`CONVERGENT`	single block, no control divergence
	`SHAREDMEM`	`CONVERGENT` + use of shared memory
	`HIERARCHICAL`	`SHAREDMEM` + multiple blocks per grid
	`THREADCOARSENING`	`HIERARCHICAL` + thread coarsening
prefix sum	`KOGG_STONE`	Kogg-Stone algorithm
	`KOGG_STONE_DOUBLE_BUFFER`	`KOGG_STONE` with double buffering technique
	`BRENT_KUNG`	Brent-Kung algorithm

More example will be added in time.

References

Programming Massively Parallel Processors: A Hands-On Approach (4th edn.) by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
CUDA Training Series organized by NVIDIA and Oak Ridge National Laboratory
NVIDIA blogs

References

Enjoy Reading This Article?