High Perfomance Computing: GPU programming using CUDA
Synopsis High_Performance_Computing_2009_2010.pdf
Course project will be assigned towards the end of the course.
Master official page at Universitat Pompeu Fabra
Setting up
The first part of the course will be carried out on your laptops using device emulation. Explanation for a Linux system (also applicable with minor modifications to Mac OS X). Please install the following items from http://www.nvidia.com/object/cuda_get.html:
if you have a CUDA compatible graphics card (See list here, http://www.nvidia.com/object/cuda_learn_products.html), install the CUDA 2.3 driver.
- Install the CUDA 2.3 toolkit
- Install the CUDA 2.3 SDK code samples
You need to export the PATH to the CUDA compilers and the path to shared libraries LD_LIBRARY_PATH (as indicated during installation of the toolkit). If you do not have a CUDA compatible Nvidia card, use
export emu=1; make
Note to Mac OSX users: You will need to install from the Mac OSX Install DVD the Xcode Tools package if you don't have the gcc compiler already.
Note to Ubuntu 9.10 users: follow the steps in this page or in this other page (section Installing the CUDA SDK and Compiling the Example Programs)
CUDA Programming Guide
The C Programming Language, http://en.wikipedia.org/wiki/The_C_Programming_Language_(book)
NVIDIA_CUDA_Programming_Guide_2.3.pdf
NVIDIA_CUDA_BestPracticesGuide_2.3.pdf
Course Material
Program 2009-2010
- Class 1 (2h) - Introduction
- Presentation of the course and the course material
- Setting up compilers
- Setting up the CUDA environment
- Practising with C: hello world, compilation, linking
Class 2 (2h) - Introduction to C, part I INTRO_C_POINTERS.pdf
- variables, conditionals, for and while loops
- data types, operation between integers and floats
- functions
- Exercises from the Book
- Class 3 (2h) - Introduction to C, part II
- Pointers
- Arrays, strings
- Memory allocation malloc/free
- Ex. function to swap two variables, compute length of a string, copy a string, copy a string in reverse order into another
- Ex. Implement matrix.c from matrix.h interface
Class 4 (2h) - Parallel environments INTRO_MULTICORES.pdf
- CPU vs GPU peak performance
- CPU vs GPU architectures
- Data and memory centrality for performance (cache, mem bandwidth)
- Compendium of parallel programming: Amhadl's Law, communication, speed-ups.
- Performing a benchmark, timing routines in C (clock).
- Ex. Partial correction of matrix.c
Class 5 (2h) - GPU CUDA programming:"Hello world GPUs!" HelloGPUs.pdf
- CUDA memory host-device transfers (cudaMalloc,cudaFree,cudaMemcpy)
- CUDA kernels (kernels and syncronization barriers)
- CUDA cutil for error checking
- Debugging using printf in device emulation
Ex. vector_add. Source codevector_add.tgz
- Class 6 (2h) - Threading and data spaces
- CUDA threading (grid, block, warp).
- unidimensional block/grid partitions
- two-dimensional block/grid partitions
- Ex. Matrix multiplication
Class 7 (2h) - Running and profiling (TG) hpc_class_7.ppt
- access to the test board, environment setup and compilation
- generating cubin summary for vector_add
- CUDA_PROFILE and its output
- grid/block size limits
- comparing the profiled occupancy with the calculator
- vector_add: block vs. thread versions
Class 8 (2h) - Profiling matrix multiplication (TG) hpc_class_8.pdf matrix_template.tgz
- review of linear array indexing
- profiling vector_add: memory-bound vs cpu-bound
- review of matrix multiply in global memory
- using and addressing shared memory for matrix multiply
Program 2008-2009
- Class 1 - Multiprocessor environments
- CPU vs GPU peak performance
- CPU vs GPU architectures
- Data and memory centrality for performance (cache, mem bandwidth)
- Performing a benchmark, timing routines
- Ex. Implement matrix.c from matrix.h interface
- Class 2 - GPU Computing and CUDA API
- CUDA threading (grid, block, warp)
- CUDA memory host-device transfers (cudaMalloc,cudaFree,cudaMemcpy)
- CUDA kernels (kernels and syncronization barriers)
- CUDA cutil for error checking
- Class 3 - Threading space
- Unidimensional block/grid partitions
- Debugging using printf
- Ex. vec_add
- Class 4 - Threading space
- Bidimensional block/grid partitions
- Ex. Matrix multiplication
- Class 5 - Shared memory
- Basic use of shared memory
- Ex. Matrix multiplication with shmem
- Class 6 - Memory access optimizations
- Measuring memory bandwidth
- Coalescence of memory operations (warp and memory banks)
- Use of CUDA data types (inside kernels and for memory access)
- Ex. Optimize matrix multiplication
- Class 7 - Syncronization problems
- Atomic operations
- Ex. histogram
- Class 8 - Optimization
- Use of cuda_profile
- Occupancy optimization (shared memory, register space, warp)
- Thread divergence
- Class 9 - Project design evaluation I
- Class 10 - Project design evaluation II
Interesting links
Another CUDA course: http://courses.ece.uiuc.edu/ece498/al/.
Scratch space
//matrix.h
#ifndef _MATRIX_H
#define _MATRIX_H
typedef struct{
float *m;
int N;
int M;
} matrix;
extern matrix* mat_create(int N, int M);
extern void mat_init(matrix* mat, float val);
extern void mat_destroy(matrix* this);
extern int mat_mult(matrix* a, matrix* b, matrix* c);
static inline float* mat_idx(matrix* a, int i, int j) {
return ((a->m)+i*(a->M)+j);
}
extern void mat_print(matrix* t);
#endif
//matrix.c
Best course project 2009 - CUDA GAME OF LIFE
This is in an implementation of the popular Conway Game of Life using parallel computing and NVIDIA CUDA. http://en.wikipedia.org/wiki/Conway%27s_Game_of_Life.
Conway Game of Life Rules
These are the classic rules, the original that the mathematician John Conway designed in 1970.
The game board is a 2D matrix. The cells in the matrix have 2 possible states, alive or dead. The state of each cell depends on the cells that surrounds it. The rules that determine the fate of each cell each iteration are the following: 1. Any live cell with fewer than two live neighbours dies, as if by needs caused by underpopulation. 2. Any live cell with more than three live neighbours dies, as if by overcrowding. 3. Any live cell with two or three live neighbours lives, unchanged, to the next generation. 4. Any dead cell with exactly three live neighbours becomes a live cell.
Game Controls
Zoom in: “+” or Mouse wheel up
Zoom out: “-” or Mouse wheel down
Up: W or mouse click
Down: S or mouse click
Left:A or mouse click
Right:D or mouse click
Run: Hold any other key.
Source code and executable: life_project.tgz