High Perfomance Computing: GPU programming using CUDA

alt Universitat Pompeu Fabra

Synopsis High_Performance_Computing_2009_2010.pdf

Course project will be assigned towards the end of the course.

Master official page at Universitat Pompeu Fabra

Setting up

The first part of the course will be carried out on your laptops using device emulation. Explanation for a Linux system (also applicable with minor modifications to Mac OS X). Please install the following items from http://www.nvidia.com/object/cuda_get.html:

You need to export the PATH to the CUDA compilers and the path to shared libraries LD_LIBRARY_PATH (as indicated during installation of the toolkit). If you do not have a CUDA compatible Nvidia card, use

export emu=1;
make

Note to Mac OSX users: You will need to install from the Mac OSX Install DVD the Xcode Tools package if you don't have the gcc compiler already.

Note to Ubuntu 9.10 users: follow the steps in this page or in this other page (section Installing the CUDA SDK and Compiling the Example Programs)

Course Material

The C Programming Language, http://en.wikipedia.org/wiki/The_C_Programming_Language_(book)

NVIDIA_CUDA_Programming_Guide_2.3.pdf

NVIDIA_CUDA_BestPracticesGuide_2.3.pdf

Interesting links. Another CUDA course: http://courses.ece.uiuc.edu/ece498/al/

Lectures

Program 2009-2010

  • Class 1 (2h) - Introduction
    • Presentation of the course and the course material
    • Setting up compilers
    • Setting up the CUDA environment
    • Practising with C: hello world, compilation, linking
  • Class 2 (2h) - Introduction to C, part I INTRO_C_POINTERS.pdf

    • variables, conditionals, for and while loops
    • data types, operation between integers and floats
    • functions
    • Exercises from the Book
  • Class 3 (2h) - Introduction to C, part II
    • Pointers
    • Arrays, strings
    • Memory allocation malloc/free
    • Ex. function to swap two variables, compute length of a string, copy a string, copy a string in reverse order into another
    • Ex. Implement matrix.c from matrix.h interface
  • Class 4 (2h) - Parallel environments INTRO_MULTICORES.pdf

    • CPU vs GPU peak performance
    • CPU vs GPU architectures
    • Data and memory centrality for performance (cache, mem bandwidth)
    • Concept of a parallel program: threads
    • Compendium of parallel programming: Amhadl's Law, communication, speed-ups.
    • Performing a benchmark, timing routines in C (clock).
    • Ex. Partial correction of matrix.c
  • Class 5 (2h) - GPU CUDA programming:"Hello world GPUs!" HelloGPUs.pdf

    • CUDA memory host-device transfers (cudaMalloc,cudaFree,cudaMemcpy)
    • CUDA kernels (kernels and syncronization barriers)
    • CUDA cutil for error checking
    • Debugging using printf in device emulation
    • Ex. vector_add. Source codevector_add.tgz

  • Class 6 (2h) - Threading and data spaces
    • CUDA threading (grid, block, warp).
    • unidimensional block/grid partitions
    • two-dimensional block/grid partitions
    • Ex. Matrix multiplication
  • Class 7 (2h) - Running and profiling (TG) hpc_class_7.ppt

    • access to the test board, environment setup and compilation
    • generating cubin summary for vector_add
    • CUDA_PROFILE and its output
    • grid/block size limits
    • comparing the profiled occupancy with the calculator
    • vector_add: block vs. thread versions
  • Class 8 (2h) - Profiling matrix multiplication (TG) hpc_class_8.pdf matrix_template.tgz

    • review of linear array indexing
    • profiling vector_add: memory-bound vs cpu-bound
    • review of matrix multiply in global memory
    • using and addressing shared memory for matrix multiply
  • Class 9 (2h)
    • Hands-on
  • Class 10 (2h)- More advanced optimizations
    • Measuring memory bandwidth
    • Coalescence of memory operations (warp and memory banks)
    • Thread divergence
    • Synchronization, atomic operations
    • Ex. histogram
  • Class 11 (2h) - Practical tests and matrix assignment
    • Assignment, matrix multiplication:
      • Matrix1 (1 thread/block),Matrix2 (64 threads/block),Matrix3 (64 threads/block and use of shared memory)
      • Provide a description of the cost of each implementation as reported by CUDA_PROFILE and the registers/shared memory/occupancy for each.
    • Test 1 hour on paper only. You can bring only the CUDA manual and cannot use your laptop (25%)
  • Class 12 (2h) - Project design (50%)
  • Class 13 (2h) - Project hands-on
  • Class 14 (2h) - Project hands-on
  • Class 15 (2h) - Project hands-on

Scratch space

//matrix.h
#ifndef _MATRIX_H
#define _MATRIX_H

typedef struct{
  float *m;
  int N;
  int M;
} matrix;

extern matrix* mat_create(int N, int M);
extern void mat_init(matrix* mat, float val);
extern void mat_destroy(matrix* this);
extern int mat_mult(matrix* a, matrix* b, matrix* c);
static inline float* mat_idx(matrix* a, int i, int j) {
  return ((a->m)+i*(a->M)+j);
}
extern void mat_print(matrix* t);

#endif

//matrix.c

Projects 2010

  • Project A - Bayesian reconstruction of gene regulatory networks on GPU
    • David, Manu
  • Project B - Molecular dynamics code for Lennard-Jones fluids on GPU
    • Oscar, Xavier, Amadis
  • Project C - Partial least squares regression on GPU
    • Leonor, Oriol, Marta

Best course project 2009 - CUDA GAME OF LIFE

This is in an implementation of the popular Conway Game of Life using parallel computing and NVIDIA CUDA. http://en.wikipedia.org/wiki/Conway%27s_Game_of_Life.

Conway Game of Life Rules

These are the classic rules, the original that the mathematician John Conway designed in 1970.

The game board is a 2D matrix. The cells in the matrix have 2 possible states, alive or dead. The state of each cell depends on the cells that surrounds it. The rules that determine the fate of each cell each iteration are the following: 1. Any live cell with fewer than two live neighbours dies, as if by needs caused by underpopulation. 2. Any live cell with more than three live neighbours dies, as if by overcrowding. 3. Any live cell with two or three live neighbours lives, unchanged, to the next generation. 4. Any dead cell with exactly three live neighbours becomes a live cell.

Game Controls

Zoom in: “+” or Mouse wheel up

Zoom out: “-” or Mouse wheel down

Up: W or mouse click

Down: S or mouse click

Left:A or mouse click

Right:D or mouse click

Run: Hold any other key.

Source code and executable: life_project.tgz

Copyright 2008-2009. All rights reserved.