# Lec 3 CUDA Software Abstraction Tonghua Su School of Software Harbin Institute of Technology #### **Outline** - Review Lec1 & 2 - Multithreading - 3 CUDA Abstraction - **Warp Scheduling** - **5** Lab 2 - **6** Software Layers in CUDA #### **Outline** - 1 Review Lec1 & 2 - Multithreading - 3 CUDA Abstraction - **Warp Scheduling** - **5** Lab 2 - **6** Software Layers in CUDA ``` Hello CUDA: Vector Sum addKernel (int * const a, const int * const int * const c) global 线程ID,同时索引数据元素 const unsigned int i = threadIdx.x; void main(){ int *dev a, *dev b, *dev c; // Allocate GPU buffers for three vectors (two input, one output) 分配显存 cudaMalloc((void**)&dev c, 128* sizeof(int)); // Copy input vectors from host memory to GPU buffers. 数据从主机复制到 cudaMemcpy(dev_a, a, 128* sizeof(int), cudaMemcpyHostToDevice); GPU cudaMemcpy(dev_b, b, 128* sizeof(int), cudaMemcpyHostToDevice); // Launch a kernel on the GPU with one thread for each element. 调用内核函数 addKernel addKernel<<<1, 128>>>(dev_c, dev_a, dev_b); 数据从GPU复制回 // Copy output vector from GPU buffer to host memory. 主机 cudaMemcpy(c, dev_c, 128* sizeof(int), cudaMemcpyDeviceToHost); cudaFree(dev_c); 释放显存 ``` Amdahl's Law $$Speedup = \frac{1}{r}$$ $$r_{s} + \frac{p}{N}$$ - Kepler GPU Architecture - building block is a "streaming multiprocessor" (SMX): - √ 192 cores and 64k registers - √ 64KB of shared memory / L1 cache - ✓ 8KB cache for constants - √ 48KB texture cache for read-only arrays - ✓ up to 2K threads per SMX Fermi GPU Architecture SM SM SMSMSM SM SML2 cache SM SM L1 cache / SM SM SMSM SM shared memory - Fermi GPU Architecture - older Fermi GPU has SM "streaming multiprocessor": - ✓ 32 cores and 32k registers - ✓ 64KB of shared memory / L1 cache - ✓ 8KB cache for constants - ✓ up to 1536 threads per SM #### **Outline** - Review Lec1 & 2 - **2** Multithreading - 3 CUDA Abstraction - **Warp Scheduling** - **5** Lab 2 - **6** Software Layers in CUDA # **Multithreading** - Key hardware feature is that the cores in an SMX are SIMT (Single Instruction Multiple Threads) cores: - ✓ all cores execute the same instructions simultaneously, but with different data - ✓ similar to vector computing on CRAY supercomputers - ✓ minimum of 32 threads all doing the same thing at (almost) the same time - ✓ natural for graphics processing and much scientific computing - ✓ SIMT is also a natural choice for many-core chips to simplify each core # **Multithreading** - Lots of active threads is the key to high performance: - ✓ no "context switching": each thread has its own registers (which limits the number of active threads) - ✓ threads on each SMX execute in groups of 32 called "warps" execution alternates between "active" warps, with warps becoming temporarily "inactive" when waiting for data ## **Multithreading** for each thread, one operation completes long before the next starts – avoids the complexity of pipeline overlaps which can limit the performance of modern processors memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations, so hopefully there's enough computation to hide the latency #### **Outline** - Review Lec1 & 2 - 2 Multithreading - **3 CUDA Abstraction** - Warp Scheduling - **5** Lab 2 - **6** Software Layers in CUDA #### **CUDA** - CUDA (Compute Unified Device Architecture) is NVIDIA's program development environment: - ✓ based on C with some extensions - ✓ extensive C++ support - ✓ FORTRAN support provided by PGI compiles lots of example code and good documentation - ✓ 2-4 week learning curve for those with experience of OpenMP and MPI programming - ✓ large user community on NVIDIA forums - CUDA virtualizes the physical hardware - ✓ thread is a virtualized scalar processor (registers, PC, state) - ✓ block is a virtualized multiprocessor (threads, shared mem.) - Scheduled onto physical hardware without pre-emption - √ threads/blocks launch & run to completion/suspension - ✓ blocks should be independent **Global Memory** - Key Parallel Abstractions in CUDA - ✓ Hierarchy of concurrent threads - ✓ Shared memory model for cooperating threads - ✓ Lightweight synchronization primitives #### Key Parallel Abstractions in CUDA ✓ Hierarchy of concurrent threads Tonghua Su, School of Software, Harbin Institute of Technology, China - Key Parallel Abstractions in CUDA - ✓ Hierarchy of concurrent threads - ✓ Shared memory model for cooperating threads Host #### • Each thread can: - ✓ Read/write per-thread registers - ✓ Read/write per-thread local memory - ✓ Read/write per-block shared memory - ✓ Read/write per-grid global memory - ✓ Read/only per-grid constant memory - Key Parallel Abstractions in CUDA - ✓ Hierarchy of concurrent threads - ✓ Shared memory model for cooperating threads - **✓** Lightweight synchronization primitives - Global Synchronization - ✓ Finish a kernel and start a new one - ✓ All writes from all threads complete before a kernel finishes ``` step1<<<grid1,blk1>>>(...); // The system ensures that all writes from step1 complete. step2<<<grid2,blk2>>>(...); ``` **✓** Would need to decompose kernels into before and after parts #### Threads Synchronization - ✓ To ensure the threads visit the shared memory in order - ✓ \_\_syncthreads() ``` global void adj diff(int *result, int *input) int tx = threadIdx.x; // allocate a shared array, one element per thread shared int s data[BLOCK SIZE]; // each thread reads one element to s data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads complete before continuing syncthreads(); if(tx > 0) result[i] = s_data[tx] - s_data[tx-1]; else if(i > 0) // handle thread block boundary result[i] = s data[tx] - input[i-1]; ``` #### Race Conditions - $\checkmark$ What is the value of a in thread 0? - $\checkmark$ What is the value of a in thread 127? ✓ CUDA provides atomic operations to deal with this problem #### Atomics - ✓ An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes - ✓ Different types of atomic instructions: - atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} - ✓ Atomics are slower than normal load/store - ✓ You can have the whole machine queuing on a single location in memory. - ✓ More types in Fermi - ✓ Atomics unavailable on G80! #### Atomics #### **Outline** - Review Lec1 & 2 - 2 Multithreading - 3 CUDA Abstraction - **Warp Scheduling** - **5** Lab 2 - **6** Software Layers in CUDA • In its simplest form it looks like: kernel\_routine<<<gridDim, blockDim>>>(args); - ✓ gridDim is the number of instances of the kernel (the "grid" size) - ✓ blockDim is the number of threads within each instance (the "block" size) - ✓ args is a limited number of arguments, usually mainly pointers to arrays in graphics memory, and some constants which get copied by value - ✓ The more general form allows gridDim and blockDim to be 2D or 3D to simplify application programs 2D block and 2D grid 3D block and 3D grid • How to calculate global block ID and thread ID? - At a lower level, within the GPU: - ✓ each block of the execution kernel executes on an SMX - ✓ if the number of blocks exceeds the number of SMXs, then more than one will run at a time on each SMX if there are enough registers and shared memory, and the others will wait in a queue and execute later - ✓ all threads within one block can access local shared memory but can't see what the other block are doing (even if they are on the same SMX) - ✓ there are no guarantees on the order in which the blocks execute - Block Scheduling - ✓ Execute in warps of 32 threads Warp 1 Warp 2 Warp 3 Ready (Theads 32 (Theads 64 (Theads 96 Queue to 63) to 95) to 127) Warp 0 Executing (Theads 0 to 31) Suspended Memory Request Pending Scheduling Cycle 0 - Block Scheduling - ✓ Execute in warps of 32 threads - Block Scheduling - ✓ Execute in warps of 32 threads Ready Queue Executing Warp 3 Warp 2 Warp 1 Warp 0 Suspended (Theads 64 (Theads 96 (Theads 32 (Theads 0 to to 63) to 95) to 127) 31) Memory Address 0 to Address 32 Address 64 Address 96 Request to 127 31 to 63 to 95 Pending Scheduling Cycle 8 #### Block Scheduling ✓ Execute in warps of 32 threads Warp 1 Ready (Theads 32 Queue to 63) Warp 0 Executing (Theads 0 to 31) Warp 2 Warp 3 Suspended (Theads 64 (Theads 96 to 95) to 127) Memory Address 64 Address 96 Request to 95 to 127 Pending Scheduling Cycle 9 #### **Outline** - Review Lec1 & 2 - 2 Multithreading - 3 CUDA Abstraction - Warp Scheduling - **Lab 2.1** - **6** Software Layers in CUDA # **Lab 2.1 Warp Scheduling** #### ● 理解线程束的调度机制 - ✓ 验证warp的线程数量 - ✓ 加入计时功能,对warp的调度时间进行输出,并绘出散点图进行分析 - ✓ 变大block和grid的大小会如何? - ✔ 给出对线程束调度机制的理解 - ✓ 参见COOK 5.3 和WILT 7.3.3 #### **Outline** - Review Lec1 & 2 - 2 Multithreading - 3 CUDA Abstraction - **Warp Scheduling** - **5** Lab 2 - **6** Software Layers in CUDA Software Layer Tonghua Su, School of Software, Harbin Institute of Technology, China nvcc compiling