SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
HPC COMPUTING WITH
CUDA AND TESLA HARDWARE
  Timothy Lanfear, NVIDIA
WHAT IS GPU COMPUTING?



© NVIDIA Corporation 2009
What is GPU Computing?




                            x86      PCIe bus      GPU


                            Computing with CPU + GPU
                            Heterogeneous Computing
© NVIDIA Corporation 2009
Low Latency or High Throughput?

                                                                             ALU   ALU
           CPU                                                 Control
                                                                             ALU   ALU
                            Optimised for low-latency
                            access to cached data sets                   Cache
                            Control logic for out-of-order
                            and speculative execution        DRAM

           GPU
                            Optimised for data-parallel,
                            throughput computation
                            Architecture tolerant of
                            memory latency
                            More transistors dedicated to
                                                             DRAM
                            computation
© NVIDIA Corporation 2009
Tesla C1060 Computing Processor
                               Processor              1 x Tesla T10
                             Number of cores               240
                               Core Clock              1.296 GHz

                                               933 GFlops Single Precision
                              Floating Point
                              Performance
                                               78 GFlops Double Precision

                            On-board memory              4.0 GB
                            Memory bandwidth        102 GB/sec peak
                               Memory I/O        512-bit, 800MHz GDDR3
                                                 512-
                                                 Full ATX: 4.736” x 10.5”
                               Form factor
                                                      Dual slot wide
                               System I/O            PCIe x16 Gen2
                              Typical power              160 W
© NVIDIA Corporation 2009
Tesla Streaming Multiprocessor (SM)
                            SM has 8 Thread Processors (SP)
                              IEEE 754 32-bit floating point
                              32-bit and 64-bit integer
                              16K 32-bit registers
                            SM has 2 Special Function Units (SFU)
                            SM has 1 Double Precision Unit (DP)
                              IEEE 754 64-bit floating point
                              Fused multiply-add
                            Multithreaded Instruction Unit
                              1024 threads, hardware multithreaded
                              32 single-instruction, multi-thread warps of 32 threads
                              Independent thread execution
                              Hardware thread scheduling
                            16KB Shared Memory
                              Concurrent threads share data
                              Low latency load/store

© NVIDIA Corporation 2009
Computing with Tesla
         240 SP processors at 1.5 GHz: 1 TFLOPS peak
         128 threads per processor: 30,720 threads total
         Tesla PCI-e board: C1060 (1 GPU)                                                                                                                   SM
                                                                                                                                                           I-Cache
         1U Server: S1070 (4 GPUs)                                                                                                                         MT Issue
                                                                                                                                                           C-Cache

                                                                                                                                                           SP SP
                              Host CPU       Bridge           System Memory



                                         Work Distribution
                                                                                           Tesla T10                                                       SP SP

                                                                                                                                                           SP SP

                                                                                                                                                           SP SP
                                                                                                                                                           SFU SFU

                                                                                           Interconnection Network
                                                                                                                                                             DP
                            ROP    L2        ROP         L2           ROP     L2   ROP   L2              ROP         L2   ROP   L2   ROP   L2   ROP   L2
                                                                                                                                                           Shared
                             DRAM               DRAM                    DRAM        DRAM                    DRAM           DRAM       DRAM       DRAM      Memory


© NVIDIA Corporation 2009
CUDA ARCHITECTURE



© NVIDIA Corporation 2009
CUDA Parallel Computing Architecture

           Parallel computing architecture
           and programming model

           Includes a C compiler plus
           support for OpenCL and
           DirectCompute

           Architected to natively support
           multiple computational
           interfaces (standard languages
           and APIs)


© NVIDIA Corporation 2009
NVIDIA CUDA C and OpenCL


                                           CUDA C   Entry point for developers
                                                    who prefer high-level C


     Entry point for developers
       who want low-level API
                                  OpenCL

 Shared back-end compiler
and optimization technology           PTX



                                      GPU


© NVIDIA Corporation 2009
CUDA PROGRAMMING MODEL



© NVIDIA Corporation 2009
GPU is a Multi-threaded Coprocessor

           The GPU is a compute device that is:
                            A coprocessor to the CPU or host
                            Has its own memory space
                            Runs many threads in parallel
           Data parallel portions of the application are executed on the
           device as multi-threaded kernels
           GPU threads are very lightweight; thousands are needed for full
           efficiency



© NVIDIA Corporation 2009
Heterogeneous Execution


                                    Serial Code (CPU)

                                 Parallel Kernel (GPU)
                                                               ...
                            KernelA<<< nBlk, nTid >>>(args);

                                    Serial Code (CPU)


                                 Parallel Kernel (GPU)
                            KernelB<<< nBlk, nTid >>>(args);   ...




© NVIDIA Corporation 2009
Processing Flow
                                                                              Work Distribution
                                                                                                                    GPU
                                                          Geometry Controller               Geometry Controller

                                                                SMC                                  SMC

                             Host CPU
                                         CPU             I-Cache     I-Cache               I-Cache        I-Cache
                                                         MT Issue   MT Issue              MT Issue       MT Issue
                                                         C-Cache     C-Cache              C-Cache        C-Cache
                              Bridge           PCI Bus
                                                         SP SP      SP SP                 SP SP          SP SP

                            CPU Memory                   SP SP      SP SP                 SP SP          SP SP

                                                         SP SP      SP SP                 SP SP          SP SP

                                                         SP SP      SP SP                 SP SP          SP SP

  1. Copy input data from CPU memory                     SFU SFU    SFU SFU               SFU SFU        SFU SFU

                                                         Shared     Shared                Shared         Shared
     to GPU memory                                       Memory     Memory                Memory         Memory
  2. Load GPU program and execute,                           Texture Unit                         Texture Unit
                                                                Tex L1                              Tex L1
     caching data on chip for
     performance                                               Interconnection Network
  3. Copy results from GPU memory to
     CPU memory                                             ROP          L2                   ROP            L2


© NVIDIA Corporation 2009
                                                              DRAM                                 DRAM
Execution Model
                                                    Host                      Device
A kernel is executed by a grid, which
contain blocks                                                                    Grid 1
                                                                                        Block          Block              Block
                                                           Kernel                       (0, 0)         (1, 0)             (2, 0)
These blocks contain threads                                 1
                                                                                        Block          Block              Block
                                                                                        (0, 1)         (1, 1)             (2, 1)
    A thread block is a batch of threads that
    can cooperate                                                                       Grid 2
               Sharing data through shared memory
                                                       Kernel
               Synchronizing their execution             2

    Threads from different blocks operate                      Block (1, 1)
    independently                                                   Thread    Thread      Thread    Thread      Thread
                                                                     (0, 0)    (1, 0)      (2, 0)    (3, 0)      (4, 0)


                                                                    Thread    Thread      Thread    Thread      Thread
                                                                     (0, 1)    (1, 1)      (2, 1)    (3, 1)      (4, 1)


                                                                    Thread    Thread      Thread    Thread      Thread
                                                                     (0, 2)    (1, 2)      (2, 2)    (3, 2)      (4, 2)
  © NVIDIA Corporation 2009
Thread Blocks: Scalable Cooperation

           Divide monolithic thread array into multiple blocks
                            Threads within a block cooperate via shared memory
                            Threads in different blocks cannot cooperate
           Enables programs to transparently scale to any number of processors
                                         Thread Block 0                   Thread Block 1                   Thread Block N - 1
                        threadID     0    1   2   3   4   5   6   7   0   1   2   3   4   5   6    7       0   1   2   3   4   5   6    7




                                     …                                …                                    …
                                     float x = input[threadID];       float x = input[threadID];           float x = input[threadID];
                                     float y = func(x);
                                     output[threadID] = y;
                                     …
                                                                      float y = func(x);
                                                                      output[threadID] = y;
                                                                      …
                                                                                                       …   float y = func(x);
                                                                                                           output[threadID] = y;
                                                                                                           …




© NVIDIA Corporation 2009
Transparent Scalability

  Hardware is free to schedule thread blocks on any processor


                                      Kernel grid
  Device                                                      Device
                                       Block 0      Block 1
                                       Block 2      Block 3
                                       Block 4      Block 5
      Block 0               Block 1                            Block 0   Block 1   Block 2   Block 3
                                       Block 6      Block 7

      Block 2               Block 3                            Block 4   Block 5   Block 6   Block 7


      Block 4               Block 5


      Block 6               Block 7



© NVIDIA Corporation 2009
Reason for blocks: GPU scalability
                               G80: 128 Cores



G84: 32 Cores




                                                GT200: 240 SP Cores




   © NVIDIA Corporation 2009
Hierarchy of Threads
                                                     Thread
     Thread                                            Computes result elements
                                                       Thread id number
                                                     Thread Block
                              Block                    Runs on one SM, shared mem
                            t0 t1 t2 … tm              1 to 512 threads per block
                                                       Block id number
                                                     Grid of Blocks
                                                       Holds complete computation task
                                                       One to many blocks per Grid
                                                     Sequential Grids
                                                       Compute sequential problem steps
                                    Grid
                                     Bl. 0   Bl. 1       Bl. 2              Bl. n
                                                                  ...
© NVIDIA Corporation 2009
CUDA Memory
                                                                    Block
  Thread                                                                        Shared
                               Local Memory                                     Memory
                                                    Local barrier




                             Grid 0

Global                                                                           Sequential
barrier                                       ..
                                                                       Global    Grids
                                              .
                             Grid 1                                    Memory    in Time


                                              ...
 © NVIDIA Corporation 2009
Memory Spaces


   Memory                   Location   Cached   Access   Scope                    Lifetime
   Register                 On-chip    N/A      R/W      One thread               Thread
   Local                    Off-chip   No       R/W      One thread               Thread
   Shared                   On-chip    N/A      R/W      All threads in a block   Block
   Global                   Off-chip   No       R/W      All threads + host       Application
   Constant                 Off-chip   Yes      R        All threads + host       Application
   Texture                  Off-chip   Yes      R        All threads + host       Application


© NVIDIA Corporation 2009
Compiling CUDA C
                                           CUDA C
                                                                   Combined CPU-GPU Code
                                          Application


                                            NVCC

                              CUDA C                      Rest of C
                              Kernels                    Application


                             CUDACC                     CPU Compiler

                            CUDA object                  CPU object
                               files                       files


                                            Linker
                                                                   CPU-GPU
© NVIDIA Corporation 2009                                         Executable
Compilation Commands

           nvcc <filename>.cu [-o <executable>]
                            Builds release mode
           nvcc –g <filename>.cu
                            Builds debug (device) mode
                            Can debug host code but not device code (runs on GPU)
           nvcc –deviceemu <filename>.cu
                            Builds device emulation mode
                            All code runs on CPU, but no debug symbols
           nvcc –deviceemu –g <filename>.cu
                            Builds debug device emulation mode
                            All code runs on CPU, with debug symbols
                            Debug using gdb or other linux debugger


© NVIDIA Corporation 2009
Exercise 0: Run a Simple Program
                                              CUDA Device device (Runtime API) version (CUDART static linking)
                                               There is 1 Query supporting CUDA
                                              There are 2 devices supporting CUDA
           Log on to test system               Device 0: "Quadro FX 570M"
                                              Device 0:revision C1060"
                                                 Major "Tesla number:                            1
           Compile and run pre-written CUDA      Minor Capabilitynumber: revision number:
                                                 CUDA revision Major
                                                 Total Capability global revision number:
                                                 CUDA amount of Minor memory:
                                                                                                 1
                                                                                                 1
                                                                                                 3
                                                                                                 268107776 bytes
           program — deviceQuery                 Number amount of global memory:
                                                 Total of multiprocessors:
                                                 Number of multiprocessors:
                                                                                                 4294705152 bytes
                                                                                                 4
                                                                                                 30
                                                 Number of cores:                                32
                                                 Number of cores:
                                                 Total amount of constant memory:                240
                                                                                                 65536 bytes
                                                 Total amount of shared memory per block:
                                                 Total  amount of constant memory:               16384 bytes
                                                                                                 65536 bytes
                                                 Total number of registers available per block: 8192 bytes
                                                 Total  amount of shared memory per block:       16384
                                                 Totalsize:
                                                 Warp   number of registers available per block: 32
                                                                                                 16384
                                                 Warp size:
                                                 Maximum number of threads per block:            32
                                                                                                 512
                                                 Maximum sizes of each dimension of a block:
                                                 Maximum  number of threads per block:           512 x 512 x 64
                                                                                                 512
                                                 Maximum sizes of each dimension of a block:
                                                 Maximum sizes of each dimension of a grid:      512 x 512 x 64
                                                                                                 65535 x 65535 x 1
                                                 Maximum memory of each dimension of a grid:
                                                 Maximum  sizes pitch:                           262144 xbytes x 1
                                                                                                 65535    65535
                                                 Maximum memory pitch:
                                                 Texture alignment:                              262144 bytes
                                                                                                 256 bytes
                                                 Texture alignment:
                                                 Clock rate:                                     0.95 bytes
                                                                                                 256 GHz
                                                 Clock rate:
                                                 Concurrent copy and execution:                  Yes GHz
                                                                                                 1.44
                                                 Concurrent copy and execution:                  Yes
                                               Test PASSED limit on kernels:
                                                 Run time                                        No
                                                 Integrated:                                     No
                                                 Support host exit...
                                               Press ENTER to  page-locked memory mapping:       Yes
                                                 Compute mode:                                   Exclusive (only
                                              one host thread at a time can use this device)




© NVIDIA Corporation 2009
CUDA C PROGRAMMING LANGUAGE



© NVIDIA Corporation 2009
CUDA C — C with Runtime Extensions
           Device management:
           cudaGetDeviceCount(), cudaGetDeviceProperties()

           Device memory management:
           cudaMalloc(), cudaFree(), cudaMemcpy()

           Texture management:
           cudaBindTexture(), cudaBindTextureToArray()

           Graphics interoperability:
           cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer()

© NVIDIA Corporation 2009
CUDA C — C with Language Extensions
   Function qualifiers
         __global__ void MyKernel() {}        // call from host, execute on GPU
         __device__ float MyDeviceFunc() {}   // call from GPU, execute on GPU
         __host__   int HostFunc() {}         // call from host, execute on host

   Variable qualifiers
         __device__   float MyGPUArray[32];    // in GPU memory space
         __constant__ float MyConstArray[32]; // write by host; read by GPU
         __shared__   float MySharedArray[32]; // shared within thread block

   Built-in vector types
         int1, int2, int3, int4
         float1, float2, float3, float4
         double1, double2
         etc.



© NVIDIA Corporation 2009
CUDA C — C with Language Extensions
   Execution configuration
         dim3 dimGrid(100, 50);          // 5000 thread blocks
         dim3 dimBlock(4, 8, 8);         // 256 threads per block
         MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

   Built-in variables and functions valid in device code:
         dim3               gridDim;           //   Grid dimension
         dim3               blockDim;          //   Block dimension
         dim3               blockIdx;          //   Block index
         dim3               threadIdx;         //   Thread index
         void               __syncthreads();   //   Thread synchronization




© NVIDIA Corporation 2009
SAXPY: Device Code

    void saxpy_serial(int n, float a, float *x, float *y)
    {
        for (int i = 0; i < n; ++i)
            y[i] = a*x[i] + y[i];                  Standard   C Code
    }


    __global__ void saxpy_parallel(int n, float a, float *x, float *y)
    {
                blockIdx.x*blockDim.x
                        .x*blockDim     threadIdx.x;
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n) y[i] = a*x[i] + y[i];
    }
                                                    Parallel C Code


© NVIDIA Corporation 2009
SAXPY: Host Code
  // Allocate two N-vectors h_x and h_y
  int size = N * sizeof(float);
  float* h_x = (float*)malloc(size);
  float* h_y = (float*)malloc(size);
  // Initialize them...
  // Allocate device memory
  float* d_x; float* d_y;
  cudaMalloc((void**)&d_x, size));
  cudaMalloc((void**)&d_y, size));
  // Copy host memory to device memory
  cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
  // Invoke parallel SAXPY kernel with 256 threads/block
  int nblocks = (n + 255) / 256;
  saxpy_parallel<<<nblocks, 256>>>(N, 2.0, d_x, d_y);
  // Copy result back from device memory to host memory
  cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);

© NVIDIA Corporation 2009
Exercise 1: Move Data between Host and GPU

           Start from the “cudaMallocAndMemcpy” template.
           Part 1: Allocate memory for pointers d_a and d_b on the device.
           Part 2: Copy h_a on the host to d_a on the device.
           Part 3: Do a device to device copy from d_a to d_b.
           Part 4: Copy d_b on the device back to h_a on the host.
           Part 5: Free d_a and d_b on the host.
           Bonus: Experiment with cudaMallocHost in place of malloc for
           allocating h_a.


© NVIDIA Corporation 2009
Launching a Kernel
                                                  Host                      Device

           Call a kernel with                                                   Grid 1
           Func <<<Dg,Db,Ns,S>>> (params);                                            Block          Block              Block
                                                         Kernel                       (0, 0)         (1, 0)             (2, 0)
           dim3 Dg(mx,my,1); // grid spec
           dim3 Db(nx,ny,nz); // block spec                                           Block          Block              Block
                                                                                      (0, 1)         (1, 1)             (2, 1)
           size_t Ns; // shared memory
           cudaStream_t S; // CUDA stream
           Execution configuration is passed to
           kernel with built-in variables
           dim3 gridDim, blockDim, blockIdx,
                threadIdx;
                                                             Block (1, 1)
           Extract components with                                Thread    Thread      Thread    Thread      Thread
           threadIdx.x, threadIdx.y,                               (0, 0)    (1, 0)      (2, 0)    (3, 0)      (4, 0)

           threadIdx.z, etc.                                      Thread    Thread      Thread    Thread      Thread
                                                                   (0, 1)    (1, 1)      (2, 1)    (3, 1)      (4, 1)


                                                                  Thread    Thread      Thread    Thread      Thread
                                                                   (0, 2)    (1, 2)      (2, 2)    (3, 2)      (4, 2)
© NVIDIA Corporation 2009
Exercise 2: Launching Kernels

           Start from the “myFirstKernel” template.
           Part1: Allocate device memory for the result of the kernel using
           pointer d_a.
           Part2: Configure and launch the kernel using a 1-D grid of 1-D
           thread blocks.
           Part3: Have each thread set an element of d_a as follows:
               idx = blockIdx.x*blockDim.x + threadIdx.x
               d_a[idx] = 1000*blockIdx.x + threadIdx.x
           Part4: Copy the result in d_a back to the host pointer h_a.
           Part5: Verify that the result is correct.
© NVIDIA Corporation 2009
Exercise 3: Reverse Array, Single Block

           Given an input array {a0, a1, …, an-1} in pointer d_a, store the
           reversed array {an-1, an-2, …, a0} in pointer d_b
           Start from the “reverseArray_singleblock” template
           Only one thread block launched, to reverse an array of size
               N = numThreads = 256 elements
           Part 1 (of 1): All you have to do is implement the body of the
           kernel “reverseArrayBlock()”
           Each thread moves a single element to reversed position
                            Read input from d_a pointer
                            Store output in reversed location in d_b pointer

© NVIDIA Corporation 2009
Exercise 4: Reverse Array, Multi-Block

           Given an input array {a0, a1, …, an-1} in pointer d_a, store the
           reversed array {an-1, an-2, …, a0} in pointer d_b
           Start from the “reverseArray_multiblock” template
           Multiple 256-thread blocks launched
                            To reverse an array of size N, N/256 blocks
           Part 1: Compute the number of blocks to launch
           Part 2: Implement the kernel reverseArrayBlock()
           Note that now you must compute both
                            The reversed location within the block
                            The reversed offset to the start of the block
© NVIDIA Corporation 2009
PERFORMANCE CONSIDERATIONS



© NVIDIA Corporation 2009
Single-Instruction, Multiple-Thread Execution
                            Warp: set of 32 parallel threads that execute together in
                            single-instruction, multiple-thread mode (SIMT) on a
                            streaming multiprocessor (SM)
                            SM hardware implements zero-overhead
                            warp and thread scheduling
                            Threads can execute independently
                            SIMT warp diverges and converges when threads branch
                            independently
                            Best efficiency and performance when threads of a warp
                            execute together, so no penalty if all threads in a warp take
                            same path of execution
                            Each SM executes up to 1024 concurrent threads, as 32
                            SIMT warps of 32 threads



© NVIDIA Corporation 2009
Global Memory


             Off-chip global memory is not cached                                                                                                           SM
                                                                                                                                                           I-Cache
                                                                                                                                                           MT Issue
                                                                                                                                                           C-Cache

                                                                                                                                                           SP SP
                              Host CPU       Bridge           System Memory



                                         Work Distribution
                                                                                           Tesla T10                                                       SP SP

                                                                                                                                                           SP SP

                                                                                                                                                           SP SP
                                                                                                                                                           SFU SFU

                                                                                           Interconnection Network
                                                                                                                                                             DP
                            ROP    L2        ROP         L2           ROP     L2   ROP   L2              ROP         L2   ROP   L2   ROP   L2   ROP   L2
                                                                                                                                                           Shared
                             DRAM               DRAM                    DRAM        DRAM                    DRAM           DRAM       DRAM       DRAM      Memory




© NVIDIA Corporation 2009
Efficient Access to Global Memory
  Single memory transaction (coalescing) for some memory addressing patterns




                            128 bytes global memory
                                                         Linear pattern
                                                         Not all need participate
                                                         Anywhere in block OK




                            16 threads (half-warp)


© NVIDIA Corporation 2009
Shared Memory

                            More than 1 Tbyte/sec
                            aggregate memory bandwidth
                            Use it
                               As a cache
                               To reorganize global memory
                               accesses into coalesced pattern
                               To share data between threads
                            16 kbytes per SM




© NVIDIA Corporation 2009
Shared Memory Bank Conflicts

                                      Successive 32-bit words
                                      assigned to different banks
    Thread 0                Bank 0
                                      Simultaneous access to the
    Thread 1                Bank 1
                                      same bank by threads in a half-
    Thread 2                Bank 2
                                      warp causes conflict and
    Thread 3                Bank 3
                                      serializes access
    Thread 4                Bank 4
    Thread 5                Bank 5
                                      Linear access pattern
    Thread 6                Bank 6    Permutation
    Thread 7                Bank 7    Broadcast (from one address)
                                      Conflict, stride 8

  Thread 15                 Bank 15

© NVIDIA Corporation 2009
Matrix Transpose
             Access columns of a tile in shared memory to write
             contiguous data to global memory
             Requires __syncthreads() since threads write data read by
             other threads
             Pad shared memory array to avoid bank conflicts


                            idata              odata
                                      tile




© NVIDIA Corporation 2009
Matrix Transpose

           There are further optimisations: see the New Matrix Transpose
           SDK example.




© NVIDIA Corporation 2009
OTHER GPU MEMORIES



© NVIDIA Corporation 2009
Texture Memory

                            Texture is an object for reading data
                            Data is cached
                            Host actions
                               Allocate memory on GPU
                               Create a texture memory reference
                               object
                               Bind the texture object to memory
                               Clean up after use
                            GPU actions
                               Fetch using texture references
                               text1Dfetch(), tex1D(), tex2D(),
                               tex3D()

© NVIDIA Corporation 2009
Constant Memory

                            Write by host, read by GPU
                            Data is cached
                            Useful for tables of constants




© NVIDIA Corporation 2009
EXECUTION CONFIGURATION



© NVIDIA Corporation 2009
Execution Configuration
         vectorAdd <<< BLOCKS, THREADS_PER_BLOCK >>> (N, 2.0, d_x, d_y);
           How many blocks?
               At least one block per SM to keep every SM occupied
                            At least two blocks per SM so something can run if block is waiting for a synchronization
                            to complete
                            Many blocks for scalability to larger and future GPUs
           How many threads?
                            At least 192 threads per SM to hide read after write latency of 11 cycles (not necessarily
                            in same block)
                            Use many threads to hide global memory latency
                                                                                                            x = y + 5;
                            Too many threads exhausts registers and shared memory
                            Thread count a multiple of warp size                                            z = x + 3;
                            Typically, between 64 and 256 threads per block

© NVIDIA Corporation 2009
Occupancy Calculator

                                          blocks per SM × threads per block
                            occupancy =
                                              maximum threads per SM


                              Occupancy calculator shows trade-offs
                              between thread count, register use,
                              shared memory use
                              Low occupancy is bad
                              Increasing occupancy doesn’t always help




© NVIDIA Corporation 2009
DEBUGGING AND PROFILING



© NVIDIA Corporation 2009
Debugging

           nvcc flags
                            –debug (-g)
                                    Generate debug information for host code
                            --device-debug <level> (-G <level>)
                                    Generate debug information for device code, plus also specify the
                                    optimisation level for the device code in order to control its
                                    ‘debuggability’. Allowed values for this option: 0,1,2,3
           Debug with
              cuda-gdb a.out
           Usual gdb commands available


© NVIDIA Corporation 2009
Debugging

           Additional commands in cuda-gdb
                            thread — Display the current host and CUDA thread of focus.
                            thread <<<(TX,TY,TZ)>>> — Switch to the CUDA thread at specified
                            coordinates
                            thread <<<(BX,BY),(TX,TY,TZ)>>> — Switch to the CUDA block and thread at
                            specified coordinates
                            info cuda threads — Display a summary of all CUDA threads that are
                            currently resident on the GPU
                            info cuda threads all — Display a list of each CUDA thread that is currently
                            resident on the GPU
                            info cuda state — Display information about the current CUDA state.
           next and step advance all threads in a warp, except at _syncthreads()
           where all warps continue to an implicit barrier following sync

© NVIDIA Corporation 2009
CUDA Visual Profiler

           cudaprof
           Documentation in $CUDA/cudaprof/doc/cudaprof.html




© NVIDIA Corporation 2009
CUDA Visual Profiler

           Open a new project
           Select session settings through dialogue
           Execute CUDA program by clicking Start button
           Various views of collected data available
           Results of different runs stored in sessions for easy comparison
           Project can be saved




© NVIDIA Corporation 2009
MISCELLANEOUS TOPICS



© NVIDIA Corporation 2009
Expensive Operations

           32-bit multiply; __mul24() and __umul24() are fast 24-bit multiplies
           sin(), exp() etc.; faster, less accurate versions are __sin(), __exp() etc.
           Integer division and modulo; avoid if possible; replace with bit shift
           operations for powers of 2
           Branching where threads of warp take differing paths of control flow




© NVIDIA Corporation 2009
Host to GPU Data Transfers

           PCI Express Gen2, 8 Gbytes/sec peak
           Use page-locked (pinned) memory for maximum bandwidth between
           GPU and host
           Data transfer host-GPU and GPU-host can overlap with computation
           both on host and GPU




© NVIDIA Corporation 2009
Application Software
                                         (written in C)
                                              CUDA Libraries
                                         cuFFT   cuBLAS      cuDPP

                            CPU Hardware        CUDA Compiler        CUDA Tools
                            1U   PCI-E Switch     C   Fortran    Debugger Profiler




© NVIDIA Corporation 2009
                                   4 cores              240 cores
On-line Course

           Programming Massively Parallel Processors, Wen-Mei Hwu,
           University of Illinois at Urbana-Champaign
           http://courses.ece.illinois.edu/ece498/al/
           PowerPoint slides, MP3 recordings of lectures, draft of textbook
           by Wen-Mei Hwu and David Kirk (NVIDIA)




© NVIDIA Corporation 2009
CUDA Zone: www.nvidia.com/CUDA
     CUDA Toolkit
                    Compiler
                    Libraries


     CUDA SDK
                    Code samples


     CUDA Profiler

     Forums

     Resources for
     CUDA developers
© NVIDIA Corporation 2009

Contenu connexe

Tendances (20)

Cuda
CudaCuda
Cuda
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
CUDA
CUDACUDA
CUDA
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Spack - A Package Manager for HPC
Spack - A Package Manager for HPCSpack - A Package Manager for HPC
Spack - A Package Manager for HPC
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 

En vedette

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Piyush Mittal
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
GPU performance analysis
GPU performance analysisGPU performance analysis
GPU performance analysisPedram Mazloom
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now MarketingGil Reich
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorNajeeb Ahmad
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreHsien-Hsin Sean Lee, Ph.D.
 

En vedette (13)

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
HPC Essentials 0
HPC Essentials 0HPC Essentials 0
HPC Essentials 0
 
GPU performance analysis
GPU performance analysisGPU performance analysis
GPU performance analysis
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now Marketing
 
Siri Vs Google Now
Siri Vs Google NowSiri Vs Google Now
Siri Vs Google Now
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
Google glass ppt
Google glass pptGoogle glass ppt
Google glass ppt
 

Similaire à Cuda tutorial

Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012Agora Group
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI
 
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle eastQnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle eastAli Shoaee
 
Qnap nas ts 1679 introduction-02
Qnap nas ts 1679 introduction-02Qnap nas ts 1679 introduction-02
Qnap nas ts 1679 introduction-02CarrierDigit
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Modeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on PrototypingModeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on PrototypingDVClub
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
OFC/NFOEC: GPU-based Parallelization of System Modeling
OFC/NFOEC: GPU-based Parallelization of System ModelingOFC/NFOEC: GPU-based Parallelization of System Modeling
OFC/NFOEC: GPU-based Parallelization of System ModelingADVA
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 

Similaire à Cuda tutorial (20)

Gpu archi
Gpu archiGpu archi
Gpu archi
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle eastQnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle east
 
Qnap nas ts 1679 introduction-02
Qnap nas ts 1679 introduction-02Qnap nas ts 1679 introduction-02
Qnap nas ts 1679 introduction-02
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Modeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on PrototypingModeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on Prototyping
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
OFC/NFOEC: GPU-based Parallelization of System Modeling
OFC/NFOEC: GPU-based Parallelization of System ModelingOFC/NFOEC: GPU-based Parallelization of System Modeling
OFC/NFOEC: GPU-based Parallelization of System Modeling
 
Sun Microsystems
Sun MicrosystemsSun Microsystems
Sun Microsystems
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 

Dernier

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Cuda tutorial

  • 1. HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA
  • 2. WHAT IS GPU COMPUTING? © NVIDIA Corporation 2009
  • 3. What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing © NVIDIA Corporation 2009
  • 4. Low Latency or High Throughput? ALU ALU CPU Control ALU ALU Optimised for low-latency access to cached data sets Cache Control logic for out-of-order and speculative execution DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to DRAM computation © NVIDIA Corporation 2009
  • 5. Tesla C1060 Computing Processor Processor 1 x Tesla T10 Number of cores 240 Core Clock 1.296 GHz 933 GFlops Single Precision Floating Point Performance 78 GFlops Double Precision On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak Memory I/O 512-bit, 800MHz GDDR3 512- Full ATX: 4.736” x 10.5” Form factor Dual slot wide System I/O PCIe x16 Gen2 Typical power 160 W © NVIDIA Corporation 2009
  • 6. Tesla Streaming Multiprocessor (SM) SM has 8 Thread Processors (SP) IEEE 754 32-bit floating point 32-bit and 64-bit integer 16K 32-bit registers SM has 2 Special Function Units (SFU) SM has 1 Double Precision Unit (DP) IEEE 754 64-bit floating point Fused multiply-add Multithreaded Instruction Unit 1024 threads, hardware multithreaded 32 single-instruction, multi-thread warps of 32 threads Independent thread execution Hardware thread scheduling 16KB Shared Memory Concurrent threads share data Low latency load/store © NVIDIA Corporation 2009
  • 7. Computing with Tesla 240 SP processors at 1.5 GHz: 1 TFLOPS peak 128 threads per processor: 30,720 threads total Tesla PCI-e board: C1060 (1 GPU) SM I-Cache 1U Server: S1070 (4 GPUs) MT Issue C-Cache SP SP Host CPU Bridge System Memory Work Distribution Tesla T10 SP SP SP SP SP SP SFU SFU Interconnection Network DP ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 Shared DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Memory © NVIDIA Corporation 2009
  • 8. CUDA ARCHITECTURE © NVIDIA Corporation 2009
  • 9. CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a C compiler plus support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs) © NVIDIA Corporation 2009
  • 10. NVIDIA CUDA C and OpenCL CUDA C Entry point for developers who prefer high-level C Entry point for developers who want low-level API OpenCL Shared back-end compiler and optimization technology PTX GPU © NVIDIA Corporation 2009
  • 11. CUDA PROGRAMMING MODEL © NVIDIA Corporation 2009
  • 12. GPU is a Multi-threaded Coprocessor The GPU is a compute device that is: A coprocessor to the CPU or host Has its own memory space Runs many threads in parallel Data parallel portions of the application are executed on the device as multi-threaded kernels GPU threads are very lightweight; thousands are needed for full efficiency © NVIDIA Corporation 2009
  • 13. Heterogeneous Execution Serial Code (CPU) Parallel Kernel (GPU) ... KernelA<<< nBlk, nTid >>>(args); Serial Code (CPU) Parallel Kernel (GPU) KernelB<<< nBlk, nTid >>>(args); ... © NVIDIA Corporation 2009
  • 14. Processing Flow Work Distribution GPU Geometry Controller Geometry Controller SMC SMC Host CPU CPU I-Cache I-Cache I-Cache I-Cache MT Issue MT Issue MT Issue MT Issue C-Cache C-Cache C-Cache C-Cache Bridge PCI Bus SP SP SP SP SP SP SP SP CPU Memory SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP 1. Copy input data from CPU memory SFU SFU SFU SFU SFU SFU SFU SFU Shared Shared Shared Shared to GPU memory Memory Memory Memory Memory 2. Load GPU program and execute, Texture Unit Texture Unit Tex L1 Tex L1 caching data on chip for performance Interconnection Network 3. Copy results from GPU memory to CPU memory ROP L2 ROP L2 © NVIDIA Corporation 2009 DRAM DRAM
  • 15. Execution Model Host Device A kernel is executed by a grid, which contain blocks Grid 1 Block Block Block Kernel (0, 0) (1, 0) (2, 0) These blocks contain threads 1 Block Block Block (0, 1) (1, 1) (2, 1) A thread block is a batch of threads that can cooperate Grid 2 Sharing data through shared memory Kernel Synchronizing their execution 2 Threads from different blocks operate Block (1, 1) independently Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) © NVIDIA Corporation 2009
  • 16. Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory Threads in different blocks cannot cooperate Enables programs to transparently scale to any number of processors Thread Block 0 Thread Block 1 Thread Block N - 1 threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 … … … float x = input[threadID]; float x = input[threadID]; float x = input[threadID]; float y = func(x); output[threadID] = y; … float y = func(x); output[threadID] = y; … … float y = func(x); output[threadID] = y; … © NVIDIA Corporation 2009
  • 17. Transparent Scalability Hardware is free to schedule thread blocks on any processor Kernel grid Device Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © NVIDIA Corporation 2009
  • 18. Reason for blocks: GPU scalability G80: 128 Cores G84: 32 Cores GT200: 240 SP Cores © NVIDIA Corporation 2009
  • 19. Hierarchy of Threads Thread Thread Computes result elements Thread id number Thread Block Block Runs on one SM, shared mem t0 t1 t2 … tm 1 to 512 threads per block Block id number Grid of Blocks Holds complete computation task One to many blocks per Grid Sequential Grids Compute sequential problem steps Grid Bl. 0 Bl. 1 Bl. 2 Bl. n ... © NVIDIA Corporation 2009
  • 20. CUDA Memory Block Thread Shared Local Memory Memory Local barrier Grid 0 Global Sequential barrier .. Global Grids . Grid 1 Memory in Time ... © NVIDIA Corporation 2009
  • 21. Memory Spaces Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application © NVIDIA Corporation 2009
  • 22. Compiling CUDA C CUDA C Combined CPU-GPU Code Application NVCC CUDA C Rest of C Kernels Application CUDACC CPU Compiler CUDA object CPU object files files Linker CPU-GPU © NVIDIA Corporation 2009 Executable
  • 23. Compilation Commands nvcc <filename>.cu [-o <executable>] Builds release mode nvcc –g <filename>.cu Builds debug (device) mode Can debug host code but not device code (runs on GPU) nvcc –deviceemu <filename>.cu Builds device emulation mode All code runs on CPU, but no debug symbols nvcc –deviceemu –g <filename>.cu Builds debug device emulation mode All code runs on CPU, with debug symbols Debug using gdb or other linux debugger © NVIDIA Corporation 2009
  • 24. Exercise 0: Run a Simple Program CUDA Device device (Runtime API) version (CUDART static linking) There is 1 Query supporting CUDA There are 2 devices supporting CUDA Log on to test system Device 0: "Quadro FX 570M" Device 0:revision C1060" Major "Tesla number: 1 Compile and run pre-written CUDA Minor Capabilitynumber: revision number: CUDA revision Major Total Capability global revision number: CUDA amount of Minor memory: 1 1 3 268107776 bytes program — deviceQuery Number amount of global memory: Total of multiprocessors: Number of multiprocessors: 4294705152 bytes 4 30 Number of cores: 32 Number of cores: Total amount of constant memory: 240 65536 bytes Total amount of shared memory per block: Total amount of constant memory: 16384 bytes 65536 bytes Total number of registers available per block: 8192 bytes Total amount of shared memory per block: 16384 Totalsize: Warp number of registers available per block: 32 16384 Warp size: Maximum number of threads per block: 32 512 Maximum sizes of each dimension of a block: Maximum number of threads per block: 512 x 512 x 64 512 Maximum sizes of each dimension of a block: Maximum sizes of each dimension of a grid: 512 x 512 x 64 65535 x 65535 x 1 Maximum memory of each dimension of a grid: Maximum sizes pitch: 262144 xbytes x 1 65535 65535 Maximum memory pitch: Texture alignment: 262144 bytes 256 bytes Texture alignment: Clock rate: 0.95 bytes 256 GHz Clock rate: Concurrent copy and execution: Yes GHz 1.44 Concurrent copy and execution: Yes Test PASSED limit on kernels: Run time No Integrated: No Support host exit... Press ENTER to page-locked memory mapping: Yes Compute mode: Exclusive (only one host thread at a time can use this device) © NVIDIA Corporation 2009
  • 25. CUDA C PROGRAMMING LANGUAGE © NVIDIA Corporation 2009
  • 26. CUDA C — C with Runtime Extensions Device management: cudaGetDeviceCount(), cudaGetDeviceProperties() Device memory management: cudaMalloc(), cudaFree(), cudaMemcpy() Texture management: cudaBindTexture(), cudaBindTextureToArray() Graphics interoperability: cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer() © NVIDIA Corporation 2009
  • 27. CUDA C — C with Language Extensions Function qualifiers __global__ void MyKernel() {} // call from host, execute on GPU __device__ float MyDeviceFunc() {} // call from GPU, execute on GPU __host__ int HostFunc() {} // call from host, execute on host Variable qualifiers __device__ float MyGPUArray[32]; // in GPU memory space __constant__ float MyConstArray[32]; // write by host; read by GPU __shared__ float MySharedArray[32]; // shared within thread block Built-in vector types int1, int2, int3, int4 float1, float2, float3, float4 double1, double2 etc. © NVIDIA Corporation 2009
  • 28. CUDA C — C with Language Extensions Execution configuration dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization © NVIDIA Corporation 2009
  • 29. SAXPY: Device Code void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Standard C Code } __global__ void saxpy_parallel(int n, float a, float *x, float *y) { blockIdx.x*blockDim.x .x*blockDim threadIdx.x; int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } Parallel C Code © NVIDIA Corporation 2009
  • 30. SAXPY: Host Code // Allocate two N-vectors h_x and h_y int size = N * sizeof(float); float* h_x = (float*)malloc(size); float* h_y = (float*)malloc(size); // Initialize them... // Allocate device memory float* d_x; float* d_y; cudaMalloc((void**)&d_x, size)); cudaMalloc((void**)&d_y, size)); // Copy host memory to device memory cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice); cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice); // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(N, 2.0, d_x, d_y); // Copy result back from device memory to host memory cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost); © NVIDIA Corporation 2009
  • 31. Exercise 1: Move Data between Host and GPU Start from the “cudaMallocAndMemcpy” template. Part 1: Allocate memory for pointers d_a and d_b on the device. Part 2: Copy h_a on the host to d_a on the device. Part 3: Do a device to device copy from d_a to d_b. Part 4: Copy d_b on the device back to h_a on the host. Part 5: Free d_a and d_b on the host. Bonus: Experiment with cudaMallocHost in place of malloc for allocating h_a. © NVIDIA Corporation 2009
  • 32. Launching a Kernel Host Device Call a kernel with Grid 1 Func <<<Dg,Db,Ns,S>>> (params); Block Block Block Kernel (0, 0) (1, 0) (2, 0) dim3 Dg(mx,my,1); // grid spec dim3 Db(nx,ny,nz); // block spec Block Block Block (0, 1) (1, 1) (2, 1) size_t Ns; // shared memory cudaStream_t S; // CUDA stream Execution configuration is passed to kernel with built-in variables dim3 gridDim, blockDim, blockIdx, threadIdx; Block (1, 1) Extract components with Thread Thread Thread Thread Thread threadIdx.x, threadIdx.y, (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) threadIdx.z, etc. Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) © NVIDIA Corporation 2009
  • 33. Exercise 2: Launching Kernels Start from the “myFirstKernel” template. Part1: Allocate device memory for the result of the kernel using pointer d_a. Part2: Configure and launch the kernel using a 1-D grid of 1-D thread blocks. Part3: Have each thread set an element of d_a as follows: idx = blockIdx.x*blockDim.x + threadIdx.x d_a[idx] = 1000*blockIdx.x + threadIdx.x Part4: Copy the result in d_a back to the host pointer h_a. Part5: Verify that the result is correct. © NVIDIA Corporation 2009
  • 34. Exercise 3: Reverse Array, Single Block Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_singleblock” template Only one thread block launched, to reverse an array of size N = numThreads = 256 elements Part 1 (of 1): All you have to do is implement the body of the kernel “reverseArrayBlock()” Each thread moves a single element to reversed position Read input from d_a pointer Store output in reversed location in d_b pointer © NVIDIA Corporation 2009
  • 35. Exercise 4: Reverse Array, Multi-Block Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_multiblock” template Multiple 256-thread blocks launched To reverse an array of size N, N/256 blocks Part 1: Compute the number of blocks to launch Part 2: Implement the kernel reverseArrayBlock() Note that now you must compute both The reversed location within the block The reversed offset to the start of the block © NVIDIA Corporation 2009
  • 37. Single-Instruction, Multiple-Thread Execution Warp: set of 32 parallel threads that execute together in single-instruction, multiple-thread mode (SIMT) on a streaming multiprocessor (SM) SM hardware implements zero-overhead warp and thread scheduling Threads can execute independently SIMT warp diverges and converges when threads branch independently Best efficiency and performance when threads of a warp execute together, so no penalty if all threads in a warp take same path of execution Each SM executes up to 1024 concurrent threads, as 32 SIMT warps of 32 threads © NVIDIA Corporation 2009
  • 38. Global Memory Off-chip global memory is not cached SM I-Cache MT Issue C-Cache SP SP Host CPU Bridge System Memory Work Distribution Tesla T10 SP SP SP SP SP SP SFU SFU Interconnection Network DP ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 Shared DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Memory © NVIDIA Corporation 2009
  • 39. Efficient Access to Global Memory Single memory transaction (coalescing) for some memory addressing patterns 128 bytes global memory Linear pattern Not all need participate Anywhere in block OK 16 threads (half-warp) © NVIDIA Corporation 2009
  • 40. Shared Memory More than 1 Tbyte/sec aggregate memory bandwidth Use it As a cache To reorganize global memory accesses into coalesced pattern To share data between threads 16 kbytes per SM © NVIDIA Corporation 2009
  • 41. Shared Memory Bank Conflicts Successive 32-bit words assigned to different banks Thread 0 Bank 0 Simultaneous access to the Thread 1 Bank 1 same bank by threads in a half- Thread 2 Bank 2 warp causes conflict and Thread 3 Bank 3 serializes access Thread 4 Bank 4 Thread 5 Bank 5 Linear access pattern Thread 6 Bank 6 Permutation Thread 7 Bank 7 Broadcast (from one address) Conflict, stride 8 Thread 15 Bank 15 © NVIDIA Corporation 2009
  • 42. Matrix Transpose Access columns of a tile in shared memory to write contiguous data to global memory Requires __syncthreads() since threads write data read by other threads Pad shared memory array to avoid bank conflicts idata odata tile © NVIDIA Corporation 2009
  • 43. Matrix Transpose There are further optimisations: see the New Matrix Transpose SDK example. © NVIDIA Corporation 2009
  • 44. OTHER GPU MEMORIES © NVIDIA Corporation 2009
  • 45. Texture Memory Texture is an object for reading data Data is cached Host actions Allocate memory on GPU Create a texture memory reference object Bind the texture object to memory Clean up after use GPU actions Fetch using texture references text1Dfetch(), tex1D(), tex2D(), tex3D() © NVIDIA Corporation 2009
  • 46. Constant Memory Write by host, read by GPU Data is cached Useful for tables of constants © NVIDIA Corporation 2009
  • 48. Execution Configuration vectorAdd <<< BLOCKS, THREADS_PER_BLOCK >>> (N, 2.0, d_x, d_y); How many blocks? At least one block per SM to keep every SM occupied At least two blocks per SM so something can run if block is waiting for a synchronization to complete Many blocks for scalability to larger and future GPUs How many threads? At least 192 threads per SM to hide read after write latency of 11 cycles (not necessarily in same block) Use many threads to hide global memory latency x = y + 5; Too many threads exhausts registers and shared memory Thread count a multiple of warp size z = x + 3; Typically, between 64 and 256 threads per block © NVIDIA Corporation 2009
  • 49. Occupancy Calculator blocks per SM × threads per block occupancy = maximum threads per SM Occupancy calculator shows trade-offs between thread count, register use, shared memory use Low occupancy is bad Increasing occupancy doesn’t always help © NVIDIA Corporation 2009
  • 50. DEBUGGING AND PROFILING © NVIDIA Corporation 2009
  • 51. Debugging nvcc flags –debug (-g) Generate debug information for host code --device-debug <level> (-G <level>) Generate debug information for device code, plus also specify the optimisation level for the device code in order to control its ‘debuggability’. Allowed values for this option: 0,1,2,3 Debug with cuda-gdb a.out Usual gdb commands available © NVIDIA Corporation 2009
  • 52. Debugging Additional commands in cuda-gdb thread — Display the current host and CUDA thread of focus. thread <<<(TX,TY,TZ)>>> — Switch to the CUDA thread at specified coordinates thread <<<(BX,BY),(TX,TY,TZ)>>> — Switch to the CUDA block and thread at specified coordinates info cuda threads — Display a summary of all CUDA threads that are currently resident on the GPU info cuda threads all — Display a list of each CUDA thread that is currently resident on the GPU info cuda state — Display information about the current CUDA state. next and step advance all threads in a warp, except at _syncthreads() where all warps continue to an implicit barrier following sync © NVIDIA Corporation 2009
  • 53. CUDA Visual Profiler cudaprof Documentation in $CUDA/cudaprof/doc/cudaprof.html © NVIDIA Corporation 2009
  • 54. CUDA Visual Profiler Open a new project Select session settings through dialogue Execute CUDA program by clicking Start button Various views of collected data available Results of different runs stored in sessions for easy comparison Project can be saved © NVIDIA Corporation 2009
  • 55. MISCELLANEOUS TOPICS © NVIDIA Corporation 2009
  • 56. Expensive Operations 32-bit multiply; __mul24() and __umul24() are fast 24-bit multiplies sin(), exp() etc.; faster, less accurate versions are __sin(), __exp() etc. Integer division and modulo; avoid if possible; replace with bit shift operations for powers of 2 Branching where threads of warp take differing paths of control flow © NVIDIA Corporation 2009
  • 57. Host to GPU Data Transfers PCI Express Gen2, 8 Gbytes/sec peak Use page-locked (pinned) memory for maximum bandwidth between GPU and host Data transfer host-GPU and GPU-host can overlap with computation both on host and GPU © NVIDIA Corporation 2009
  • 58. Application Software (written in C) CUDA Libraries cuFFT cuBLAS cuDPP CPU Hardware CUDA Compiler CUDA Tools 1U PCI-E Switch C Fortran Debugger Profiler © NVIDIA Corporation 2009 4 cores 240 cores
  • 59. On-line Course Programming Massively Parallel Processors, Wen-Mei Hwu, University of Illinois at Urbana-Champaign http://courses.ece.illinois.edu/ece498/al/ PowerPoint slides, MP3 recordings of lectures, draft of textbook by Wen-Mei Hwu and David Kirk (NVIDIA) © NVIDIA Corporation 2009
  • 60. CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit Compiler Libraries CUDA SDK Code samples CUDA Profiler Forums Resources for CUDA developers © NVIDIA Corporation 2009