SlideShare une entreprise Scribd logo
1  sur  138
Télécharger pour lire hors ligne
Massively Parallel Computing
                        CS 264 / CSCI E-292
Lecture #5: Advanced CUDA | February 22th, 2011




               Nicolas Pinto (MIT, Harvard)
                      pinto@mit.edu
Administrivia
• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)
• Projects: think about it, consult the staff (*),
  proposals due ~ Fri 3/25/11
• Guest lectures:
 • schedule coming soon
 • on Fridays 7.35-9.35pm (March, April) ?
During this course,
                          r CS264
                adapted fo



we’ll try to


          “                         ”

and use existing material ;-)
Today
yey!!
Outline

1. Hardware Review
2. Memory/Communication Optimizations
3. Threading/Execution Optimizations
1. Hardware Review
10-Series Architecture

         240 thread processors execute kernel threads
         30 multiprocessors, each contains
              8 thread processors
              One double-precision unit
              Shared memory enables thread cooperation


                                                         Multiprocessor


                                                                            Thread
                                                                          Processors

                                                              Double

                                                              Shared
                                                              Memory




© NVIDIA Corporation 2008                                                         8
Threading Hierarchy
Execution Model
Software    Hardware

                             Threads are executed by thread
               Thread
                             processors
              Processor
 Thread

                             Thread blocks are executed on
                             multiprocessors

                             Thread blocks do not migrate

                             Several concurrent thread blocks can
  Thread                     reside on one multiprocessor - limited
   Block    Multiprocessor   by multiprocessor resources (shared
                             memory and register file)

                             A kernel is launched as a grid of
                             thread blocks
      ...
                             Only one kernel can execute on a
   Grid                      device at one time
               Device
                                             © 2008 NVIDIA Corporation.
Warps and Half Warps


                            32 Threads
                                                          A thread block consists of 32-
                                                          thread warps
          ...         =     32 Threads
                                                          A warp is executed physically in
                            32 Threads                    parallel (SIMD) on a
     Thread
      Block                  Warps       Multiprocessor   multiprocessor




                                          DRAM            A half-warp of 16 threads can
                                                          coordinate global memory
                            16      16       Global       accesses into a single transaction

                            Half Warps       Local

                                            Device
                                            Memory



© NVIDIA Corporation 2008                                                              10
Memory Architecture


    Host                    Device
                                                GPU
                   CPU        DRAM
                                                   Multiprocessor
                                                        Registers
                                      Local      Multiprocessor
                                                        Shared Memory
                                                        Registers
                                                Multiprocessor
                Chipset                               Shared Memory
                                                      Registers
                                     Global
                                                   Shared Memory


                 DRAM                Constant
                                                 Constant and Texture
                                                       Caches
                                     Texture




© NVIDIA Corporation 2008                                               11
Kernel Memory Access
 Kernel Memory Access

        Per-thread
                                       Registers   On-chip
                        Thread
                                   Local Memory    Off-chip, uncached

        Per-block
                                     Shared        • On-chip, small
                    Block                          • Fast
                                     Memory


        Per-device


       Kernel 0              ...                        • Off-chip, large
                                                        • Uncached
                                           Global       • Persistent across
Time




                                           Memory           kernel launches
         Kernel 1           ...                         •   Kernel I/O
Global Memory
 Kernel Memory Access

   • Different types of “global memory”
     Per-thread
                                       Registers   On-chip

    • Linear Memory     Thread
                                   Local Memory    Off-chip, uncached


    • Texture
     Per-block Memory

    • Constant Memory
                    Block
                                    •
                                    •
                                     Shared
                                     Memory
                                                    On-chip, small
                                                    Fast


        Per-device


       Kernel 0              ...                       • Off-chip, large
                                                       • Uncached
                                           Global      • Persistent across
Time




                                           Memory          kernel launches
         Kernel 1           ...                        •   Kernel I/O
Memory Architecture



   Memory                   Location   Cached   Access   Scope                 Lifetime
   Register                 On-chip    N/A      R/W      One thread            Thread
   Local                    Off-chip   No       R/W      One thread            Thread
   Shared                   On-chip    N/A      R/W      All threads in a block Block
   Global                   Off-chip   No       R/W      All threads + host    Application
   Constant                 Off-chip   Yes      R        All threads + host    Application
   Texture                  Off-chip   Yes      R        All threads + host    Application




© NVIDIA Corporation 2008                                                                 12
2. Memory/Communication
      Optimizations
Revie w



2.1 Host/Device Transfer
        Optimizations
Rev ie w
              PC Architecture
   8 GB/s
                                            >?@

                       ?>L9G=2%&66"K16
                                                  J%+8#"F7(&"K16


    H%'2$7,6">'%("I"
                                         A+%#$)%7(B&                 F+1#$)%7(B&
        >@C!

                                                E&.+%/"K16                ?>L"K16
                                                                                               3+ Gb/s
                                           CD!E                    F!:!         G#$&%8&#      !
   160+ GB/s
      to
     VRAM                                      25+ GB/s
                                                                                    modified from Matthew Bolitho
Rev ie w
   The PCI-“not-so”-e Bus
• PCIe bus is slow
• Try to minimize/group transfers
• Use pinned memory on host whenever possible
• Try to perform copies asynchronously (e.g. Streams)
• Use “Zero-Copy” when appropriate
• Examples in the SDK (e.g. bandwidthTest)
2.2 Device Memory
     Optimizations
Definitions
• gmem: global memory
• smem: shared memory
• tmem: texture memory
• cmem: constant memory
• bmem: binary code (cubin) memory ?!?
  (covered next week)
Performance Analysis
e.g. Matrix Transpose
Matrix Transpose

        Transpose 2048x2048 matrix of floats
        Performed out-of-place
                  Separate input and output matrices
        Use tile of 32x32 elements, block of 32x8 threads
                  Each thread processes 4 matrix elements
                  In general tile and block size are fair game for
                  optimization
        Process
                  Get the right answer
                  Measure effective bandwidth (relative to theoretical or
                  reference case)
                  Address global memory coalescing, shared memory bank
                  conflicts, and partition camping while repeating above
                  steps
© NVIDIA Corporation 2008                                               22
Theoretical Bandwidth


        Device Bandwidth of GTX 280

                                                  DDR

                  1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s
                      Memory          Memory
                     clock (Hz)       interface
                                       (bytes)


                  Specs report 141 GB/s
                            Use 10^9 B/GB conversion rather than 1024^3
                            Whichever you use, be consistent



© NVIDIA Corporation 2008                                                 23
Effective Bandwidth


        Transpose Effective Bandwidth

                  2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s

                            Matrix size       Read and
                             (bytes)            write



        Reference Case - Matrix Copy
                  Transpose operates on tiles - need better comparison
                  than raw device bandwidth
                  Look at effective bandwidth of copy that uses tiles


© NVIDIA Corporation 2008                                                     24
Matrix Copy Kernel
__global__ void copy(float *odata, float *idata, int width,
                     int height)
{
  int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
  int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
  int index = xIndex + width*yIndex;

    for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) {
      odata[index+i*width] = idata[index+i*width];
    }
                                                             TILE_DIM = 32
}
                                                             BLOCK_ROWS = 8
                            idata               odata
                                                             32x32 tile
                                                             32x8 thread block


                                                                idata and odata
                                                               in global memory
                 Elements copied by a half-warp of threads


© NVIDIA Corporation 2008                                                  25
Matrix Copy Kernel Timing
        Measure elapsed time over loop
        Looping/timing done in two ways:
                  Over kernel launches (nreps = 1)
                            Includes launch/indexing overhead
                  Within the kernel over loads/stores (nreps > 1)
                            Amortizes launch/indexing overhead
         __global__ void copy(float *odata, float* idata, int width,
                              int height, int nreps)
         {
           int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
           int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
           int index = xIndex + width*yIndex;

             for (int r = 0; r < nreps; r++) {
               for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) {
                 odata[index+i*width] = idata[index+i*width];
               }
             }
         }
© NVIDIA Corporation 2008                                              26
Naïve Transpose
        Similar to copy
                  Input and output matrices have different indices
    __global__ void transposeNaive(float *odata, float* idata, int width,
                                   int height, int nreps)
    {
      int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
      int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

        int index_in = xIndex + width * yIndex;
        int index_out = yIndex + height * xIndex;

        for (int r=0; r < nreps; r++) {
          for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
            odata[index_out+i] = idata[index_in+i*width];
          }
        }                   idata                  odata
    }




© NVIDIA Corporation 2008                                                   27
Effective Bandwidth



                                        Effective Bandwidth (GB/s)
                                            2048x2048, GTX 280

                                        Loop over     Loop in kernel
                                          kernel
                       Simple Copy         96.9           81.6

                              Naïve        2.2             2.2
                            Transpose




© NVIDIA Corporation 2008                                              28
gmem coalescing
Memory Coalescing

GPU memory controller granularity is 64 or 128 bytes
  Must also be 64 or 128 byte aligned
Suppose thread loads a float (4 bytes)
  Controller loads 64 bytes, throws 60 bytes away
Memory Coalescing

  Memory controller actually more intelligent
  Consider half-warp (16 threads)
    Suppose each thread reads consecutive float
    Memory controller will perform one 64 byte load
  This is known as coalescing


Make threads read consecutive locations
Coalescing
        Global memory access of 32, 64, or 128-bit words by a half-
        warp of threads can result in as few as one (or two)
        transaction(s) if certain access requirements are met
        Depends on compute capability
                  1.0 and 1.1 have stricter access requirements


       Examples – float (32-bit) data

                      Global Memory

                                          } 64B aligned segment (16 floats)
                                          } 128B aligned segment (32 floats)

                 Half-warp of threads

© NVIDIA Corporation 2008                                                      30
Coalescing
Compute capability 1.0 and 1.1
        K-th thread must access k-th word in the segment (or k-th word in 2
        contiguous 128B segments for 128-bit words), not all threads need to
        participate

       Coalesces – 1 transaction




Out of sequence – 16 transactions            Misaligned – 16 transactions




© NVIDIA Corporation 2008                                                   31
Memory Coalescing

GT200 has hardware coalescer
Inspects memory requests from each half-warp
Determines minimum set of transactions which are
  64 or 128 bytes long
  64 or 128 byte aligned
Coalescing
   Compute capability 1.2 and higher                 (e.g. GT200 like the C1060)
          Coalescing is achieved for any pattern of addresses that fits into a
          segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for
          32- and 64-bit words
          Smaller transactions may be issued to avoid wasted bandwidth due
          to unused words



                                       1 transaction - 64B segment



                                                  1 transaction - 128B segment
2 transactions - 64B and 32B segments




  © NVIDIA Corporation 2008                                                        32
Coalescing
          Compute capability 2.0 (Fermi, Tesla C2050)
                 Memory transactions handled per warp (32 threads)
                 L1 cache ON:
                 Issues always 128B segment transactions
                 caches them in 16kB or 48kB L1 cache per multiprocessor
        2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.




                L1 cache OFF:
                Issues always 32B segment transactions
                E.g. advantage for widely scattered thread accesses
                            32 transactions - 32 x 32B segments, instead of 32 x 128B segments.




© NVIDIA Corporation 2010
Coalescing Summary

Coalescing dramatically speeds global memory access
Strive for perfect coalescing:
  Align starting address (may require padding)
  A warp should access within a contiguous region
Coalescing in Transpose

        Naïve transpose coalesces reads, but not writes


                            idata                          odata




                        Elements transposed by a half-warp of threads


                 Q: How to coalesce writes ?
© NVIDIA Corporation 2008                                               33
smem as a cache
Shared Memory

SMs can access gmem at 80+ GiB/sec
but have hundreds of cycles of latency
Each SM has 16 kiB ‘shared’ memory
  Essentially user-managed cache
  Speed comparable to registers
  Accessible to all threads in a block
Reduces load/stores to device memory
Shared Memory

       ~Hundred times faster than global memory

       Cache data to reduce global memory accesses

       Threads can cooperate via shared memory

       Use it to avoid non-coalesced access
                 Stage loads and stores in shared memory to re-order non-
                 coalesceable addressing




© NVIDIA Corporation 2008                                               34
Coalescing in Transpose

        Naïve transpose coalesces reads, but not writes


                            idata                          odata




                        Elements transposed by a half-warp of threads


                 Q: How to coalesce writes ?
© NVIDIA Corporation 2008                                               33
Shared Memory

       ~Hundred times faster than global memory

       Cache data to reduce global memory accesses

       Threads can cooperate via shared memory

       Use it to avoid non-coalesced access
                 Stage loads and stores in shared memory to re-order non-
                 coalesceable addressing




© NVIDIA Corporation 2008                                               34
A Common Programming Strategy




!   Partition data into subsets that fit into shared memory


 © 2008 NVIDIA Corporation
A Common Programming Strategy




!   Handle each data subset with one thread block


 © 2008 NVIDIA Corporation
A Common Programming Strategy




!   Load the subset from global memory to shared
    memory, using multiple threads to exploit memory-
    level parallelism
 © 2008 NVIDIA Corporation
A Common Programming Strategy




!   Perform the computation on the subset from shared
    memory

 © 2008 NVIDIA Corporation
A Common Programming Strategy




!   Copy the result from shared memory back to global
    memory

 © 2008 NVIDIA Corporation
Coalescing through shared memory

        Access columns of a tile in shared memory to write
        contiguous data to global memory
        Requires __syncthreads() since threads write data
        read by other threads

                                idata                         odata
                                                 tile




                            Elements transposed by a half-warp of threads

© NVIDIA Corporation 2008                                                   35
Coalescing through shared memory
 __global__ void transposeCoalesced(float *odata, float *idata, int width,
                                    int height, int nreps)
 {
   __shared__ float tile[TILE_DIM][TILE_DIM];

     int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
     int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
     int index_in = xIndex + (yIndex)*width;

     xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
     yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
     int index_out = xIndex + (yIndex)*height;

     for (int r=0; r < nreps; r++) {
       for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
         tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
       }

         __syncthreads();

         for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
           odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
         }
     }
 }
© NVIDIA Corporation 2008                                                    36
Effective Bandwidth


                                Effective Bandwidth (GB/s)
                                    2048x2048, GTX 280
                             Loop over kernel   Loop in kernel
      Simple Copy                 96.9              81.6          Uses shared
                                                                   memory tile
Shared Memory Copy                80.9              81.1
                                                                      and
  Naïve Transpose                  2.2               2.2         __syncthreads()

Coalesced Transpose               16.5              17.1




 © NVIDIA Corporation 2008                                                  37
smem bank conflicts
Shared Memory Architecture

      Many threads accessing memory
               Therefore, memory is divided into banks
               Successive 32-bit words assigned to successive banks


      Each bank can service one address per cycle
                                                                 Bank 0
               A memory can service as many simultaneous         Bank 1
               accesses as it has banks                          Bank 2
                                                                 Bank 3
                                                                 Bank 4
      Multiple simultaneous accesses to a bank                   Bank 5
      result in a bank conflict                                  Bank 6
               Conflicting accesses are serialized               Bank 7



                                                                 Bank 15
© NVIDIA Corporation 2008                                             39
Shared Memory Banks
                             Bank 0      0       16
                             Bank 1      1       17
                             Bank 2      2       18

Shared memory divided        Bank 3
                             Bank 4
                                         3
                                         4
                                                 19
                                                 20
into 16 ‘banks’              Bank 5      5       21
                             Bank 6      6       22
Shared memory is (almost)    Bank 7      7       23

as fast as registers (...)   Bank 8      8       24
                             Bank 9      9       25

Exception is in case of      Bank 10
                             Bank 11
                                         10
                                         11
                                                 26
                                                 27
bank conflicts                Bank 12     12      28
                             Bank 13     13      29
                             Bank 14     14      30
                             Bank 15     15      31

                                       4 bytes
Bank Addressing Examples

       No Bank Conflicts                    No Bank Conflicts
                 Linear addressing            Random 1:1 Permutation
                 stride == 1

    Thread 0                    Bank 0    Thread 0              Bank 0
    Thread 1                    Bank 1    Thread 1              Bank 1
    Thread 2                    Bank 2    Thread 2              Bank 2
    Thread 3                    Bank 3    Thread 3              Bank 3
    Thread 4                    Bank 4    Thread 4              Bank 4
    Thread 5                    Bank 5    Thread 5              Bank 5
    Thread 6                    Bank 6    Thread 6              Bank 6
    Thread 7                    Bank 7    Thread 7              Bank 7



   Thread 15                    Bank 15   Thread 15             Bank 15


© NVIDIA Corporation 2008                                                 40
Bank Addressing Examples

       2-way Bank Conflicts                 8-way Bank Conflicts
                 Linear addressing            Linear addressing
                 stride == 2               stride == 8
                                                          x8
    Thread 0                    Bank 0    Thread 0                Bank 0
    Thread 1                    Bank 1    Thread 1                Bank 1
    Thread 2                    Bank 2    Thread 2                Bank 2
    Thread 3                    Bank 3    Thread 3
    Thread 4                    Bank 4    Thread 4
                                Bank 5    Thread 5                Bank 7
                                Bank 6    Thread 6                Bank 8
                                Bank 7    Thread 7                Bank 9
   Thread 8                                                x8
   Thread 9
   Thread 10
   Thread 11                    Bank 15   Thread 15               Bank 15


© NVIDIA Corporation 2008                                                   41
Shared memory bank conflicts
       Shared memory is ~ as fast as registers if there are no bank
       conflicts

       warp_serialize profiler signal reflects conflicts

       The fast case:
                 If all threads of a half-warp access different banks, there is no
                 bank conflict
                 If all threads of a half-warp read the identical address, there is no
                 bank conflict (broadcast)


       The slow case:
                 Bank Conflict: multiple threads in the same half-warp access the
                 same bank
                 Must serialize the accesses
                 Cost = max # of simultaneous accesses to a single bank
© NVIDIA Corporation 2008                                                            42
Bank Conflicts in Transpose

        32x32 shared memory tile of floats
                 Data in columns k and k+16 are in same bank
                 16-way bank conflict reading half columns in tile
        Solution - pad shared memory array
                 __shared__ float tile[TILE_DIM][TILE_DIM+1];
                               Q: How to avoid bank conflicts ?
                 Data in anti-diagonals are in same bank

                            idata                odata
                                        tile




© NVIDIA Corporation 2008                                            43
Bank Conflicts in Transpose

        32x32 shared memory tile of floats
                 Data in columns k and k+16 are in same bank
                 16-way bank conflict reading half columns in tile
        Solution - pad shared memory array
                 __shared__ float tile[TILE_DIM][TILE_DIM+1];
                 Data in anti-diagonals are in same bank

                            idata                odata
                                     tile




© NVIDIA Corporation 2008                                            43
Illustration
               !"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/78
               • :;<:; !(=('#$$#+
               • >#$?'#77%99%9'#'7*6@)0,
                   – !"#$%&'(%)*'+,)-./+01'20345%61'/)'%'$%47'%++511'035'1%85'(%)*9

                                                warps:
                                      0    1      2      31

                                       0   1      2      31
              Bank 0
              Bank 1                   0   1      2      31
               …
                                       0    1     2      31
              Bank 31


                                       0   1      2      31
NVIDIA 2010
Illustration
                 !"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789
                 • -&&'#'7*6:)0'5*$';#&&/01,
                     – !"#!! $%&%'())(*
                 • <#$;'#77%99%9'#'7*6:)0,
                     – !" +,--.)./0'1(/234'/5'1(/2'65/-7,603
                                                     warps:
                                        0   1    2            31   padding

                                        0   1    2            31
                Bank 0
                Bank 1                  0   1    2            31
                 …
                                        0   1    2            31
                Bank 31


                                        0   1    2            31
© NVIDIA 2010
Effective Bandwidth


                                    Effective Bandwidth (GB/s)
                                        2048x2048, GTX 280
                                   Loop over      Loop in kernel
                                     kernel
                     Simple Copy     96.9             81.6
            Shared Memory Copy       80.9             81.1
                 Naïve Transpose      2.2              2.2
            Coalesced Transpose      16.5             17.1
    Bank Conflict Free Transpose      16.6             17.2




© NVIDIA Corporation 2008                                          44
a pause?
Need
Unrelated: Tchatcher Illusion
Unrelated: Tchatcher Illusion
gmem partition camping
Partition Camping

        Global memory accesses go through partitions
                  6 partitions on 8-series GPUs, 8 partitions on 10-series
                  GPUs
                  Successive 256-byte regions of global memory are
                  assigned to successive partitions


        For best performance:
                  Simultaneous global memory accesses GPU-wide should
                  be distributed evenly amongst partitions


        Partition Camping occurs when global memory
        accesses at an instant use a subset of partitions
                  Directly analogous to shared memory bank conflicts, but
                  on a larger scale
© NVIDIA Corporation 2008                                                    46
Partition Camping in Transpose

       Partition width = 256 bytes = 64 floats
                  Twice width of tile
       On GTX280 (8 partitions), data 2KB apart map to
       same partition
                  2048 floats divides evenly by 2KB => columns of matrices
                  map to same partition

                                      idata                       odata
                                                                           tiles in matrices
                            0    1    2    3     4   5   0   64 128
                                                                          colors = partitions
                            64   65   66   67    68 69   1   65 129
                        128 129 130        ...           2   66 130

                                                         3   67   ...

                                                         4   68

                                                         5   69

                     blockId = gridDim.x * blockIdx.y + blockIdx.x
© NVIDIA Corporation 2008                                                                 47
Partition Camping Solutions

       Pad matrices (by two tiles)
                  In general might be expensive/prohibitive memory-wise
       Diagonally reorder blocks
                  Interpret blockIdx.y as different diagonal slices and
                  blockIdx.x as distance along a diagonal


                                      idata                        odata
                              0   64 128                  0

                                  1   65 129              64   1

                                      2    66 130         128 65    2

                                           3   67   ...        129 66   3

                                               4    68             130 67     4

                                                    5                   ...   68   5

                            blockId = gridDim.x * blockIdx.y + blockIdx.x
© NVIDIA Corporation 2008                                                              48
Diagonal Transpose
__global__ void transposeDiagonal(float *odata, float *idata, int width,
                                  int height, int nreps)
{
  __shared__ float tile[TILE_DIM][TILE_DIM+1];

    int blockIdx_y = blockIdx.x;                          Add lines to map diagonal
    int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;    to Cartesian coordinates

    int xIndex = blockIdx_x * TILE_DIM + threadIdx.x;
    int yIndex = blockIdx_y * TILE_DIM + threadIdx.y;
    int index_in = xIndex + (yIndex)*width;
                                                                          Replace
    xIndex = blockIdx_y * TILE_DIM + threadIdx.x;
    yIndex = blockIdx_x * TILE_DIM + threadIdx.y;                        blockIdx.x
    int index_out = xIndex + (yIndex)*height;                               with
                                                                        blockIdx_x,
    for (int r=0; r < nreps; r++) {                                      blockIdx.y
      for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {                            with
        tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];     blockIdx_y
      }
      __syncthreads();
      for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
        odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
      }
    }
}
© NVIDIA Corporation 2008                                                    49
Diagonal Transpose
       Previous slide for square matrices (width == height)
       More generally:


if (width == height) {
  blockIdx_y = blockIdx.x;
  blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;
} else {
  int bid = blockIdx.x + gridDim.x*blockIdx.y;
  blockIdx_y = bid%gridDim.y;
  blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;
}




© NVIDIA Corporation 2008                                     50
Effective Bandwidth

                                     Effective Bandwidth (GB/s)
                                         2048x2048, GTX 280
                                 Loop over kernel   Loop in kernel
                  Simple Copy         96.9              81.6
         Shared Memory Copy           80.9              81.1
              Naïve Transpose          2.2               2.2
         Coalesced Transpose          16.5              17.1
  Bank Conflict Free Transpose         16.6              17.2
                      Diagonal        69.5              78.3




© NVIDIA Corporation 2008                                            51
Order of Optimizations

          Larger optimization issues can mask smaller ones
          Proper order of some optimization techniques in
          not known a priori
                   Eg. partition camping is problem-size dependent


          Don’t dismiss an optimization technique as
          ineffective until you know it was applied at the right
          time
                                               Bank                   Partition
                                              Conflicts   16.6 GB/s   Camping

 Naïve               Coalescing
                                  16.5 GB/s                                       69.5 GB/s
2.2 GB/s
                                              Partition   48.8 GB/s    Bank
                                              Camping                 Conflicts
© NVIDIA Corporation 2008                                                               52
Transpose Summary

      Coalescing and shared memory bank conflicts are
      small-scale phenomena
           Deal with memory access within half-warp
           Problem-size independent


      Partition camping is a large-scale phenomenon
           Deals with simultaneous memory accesses by warps on
           different multiprocessors
           Problem size dependent
              Wouldn’t see in (2048+32)^2 matrix


      Coalescing is generally the most critical

                             SDK Transpose Example:
   © NVIDIA Corporation 2008
http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
                                                                      53
tmem
Textures in CUDA

       Texture is an object for reading data

       Benefits:
                 Data is cached (optimized for 2D locality)
                            Helpful when coalescing is a problem
                 Filtering
                            Linear / bilinear / trilinear
                            Dedicated hardware
                 Wrap modes (for “out-of-bounds” addresses)
                            Clamp to edge / repeat
                 Addressable in 1D, 2D, or 3D
                            Using integer or normalized coordinates

       Usage:
                 CPU code binds data to a texture object
                 Kernel reads data by calling a fetch function
© NVIDIA Corporation 2008                                             55
Other goodies
Optional “format conversion”
• {char, short, int, half} (16bit)
   to
   float (32bit)
• “for free”
• useful for *mem compression (see later)
Texture Addressing
                                    0   1   2   3    4
                                0
                                                         (2.5, 0.5)
                                1                        (1.0, 1.0)
                                2
                                3

Wrap                                                          Clamp
      Out-of-bounds coordinate is                                     Out-of-bounds coordinate is
      wrapped (modulo arithmetic)                                     replaced with the closest
                                                                      boundary
              0     1       2       3   4                                  0   1   2   3   4
          0                                                            0
                                            (5.5, 1.5)                                         (5.5, 1.5)
          1                                                            1
          2                                                            2
          3                                                            3

© NVIDIA Corporation 2008                                                                                   56
Two CUDA Texture Types

       Bound to linear memory
                 Global memory address is bound to a texture
                 Only 1D
                 Integer addressing
                 No filtering, no addressing modes

       Bound to CUDA arrays
                 CUDA array is bound to a texture
                 1D, 2D, or 3D
                 Float addressing (size-based or normalized)
                 Filtering
                 Addressing modes (clamping, repeat)

       Both:
                 Return either element type or normalized float
© NVIDIA Corporation 2008                                         57
CUDA Texturing Steps
      Host (CPU) code:
               Allocate/obtain memory (global linear, or CUDA array)
               Create a texture reference object
                       Currently must be at file-scope
               Bind the texture reference to memory/array
               When done:
                       Unbind the texture reference, free resources


      Device (kernel) code:
               Fetch using texture reference
               Linear memory textures:
                       tex1Dfetch()
               Array textures:
                       tex1D() or tex2D() or tex3D()

© NVIDIA Corporation 2008                                              58
cmem
!"#$%&#%'()*"+,
                • -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$
                • 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)
                    – !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%
                    – 1'$(#$+2(3.(/.'0(32(456(7./$.+%
                    – 8,9,&.0(&#(:;<=
                • <)+*2'&..$'4#20"+*'&11)$$)$=
                    – <./$.+(>#,$&./('/?*9.$&()*'+,-,.0(@,&A(!"#$%
                    – 1#9>,+./(9*%&(0.&./9,$.(&A'&('++(&A/.'0%(,$('(&A/.'03+#"7 @,++(0./.-./.$".(&A.(%'9.('00/.%%
                    – B#(+,9,&(#$('//'2(%,C.D("'$(*%.('$2(?+#3'+(9.9#/2(>#,$&./
                • !"#$%&#%'1&13)'%3+"49374%='
                    – EF(3,&%(>./(@'/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/
                    – G#(3.(*%.0(@A.$('++(&A/.'0%(,$('(@'/>(/.'0(&A.(%'9.('00/.%%
                         •   H./,'+,C.%(#&A./@,%.



© NVIDIA 2010
!"#$%&#%'()*"+,
                • -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$
                • 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)
                     – !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%
                                                   !!>+#3'+!!(?#,0(7./$.+@("#$%&(-+#'&(A>!' B
                     – 1'$(#$+2(3.(/.'0(32(456(7./$.+%
                     – 8,9,&.0(&#(:;<=               C
                                                        DDD
                • <)+*2'&..$'4#20"+*'&11)$$)$=
                                                        -+#'&(O(P(>!'QRSTU(((((((((((((((((((VV(*$,-#/9
                    – <./$.+(E#,$&./('/>*9.$&()*'+,-,.0(F,&G(!"#$%
                                                        -+#'&(2(P(>!'Q3+#"7W0ODOXSTU((((VV(*$,-#/9
                    – 1#9E,+./(9*%&(0.&./9,$.(&G'&('++(&G/.'0%(,$('(&G/.'03+#"7 F,++(0./.-./.$".(&G.(%'9.('00/.%%
                                                        -+#'&(I(P(>!'Q&G/.'0W0ODOTU((((((VV($#$Y*$,-#/9
                    – H#(+,9,&(#$('//'2(%,I.J("'$(*%.('$2(>+#3'+(9.9#/2(E#,$&./
                                                        DDD
                • !"#$%&#%'1&13)'%3+"49374%=' Z
                     – KL(3,&%(E./(F'/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/
                     – M#(3.(*%.0(FG.$('++(&G/.'0%(,$('(F'/E(/.'0(&G.(%'9.('00/.%%
                          •   N./,'+,I.%(#&G./F,%.



© NVIDIA 2010
!"#$%&#%'()*"+,
                    • -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)
                    • @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5
                    • C$=#>'D(E(F
                       – !"#$%&"'(%)*+#$*,%-./%01%234/%5)%67,%+'"))8#
                       – 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,



                                           addresses from a warp
                                                       ...


                0     32    64     96    128   160    192    224   256    288    320   352    384   416    448


© NVIDIA 2010
!"#$%&#%'()*"+,
                    • -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)
                    • @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5
                    • C$=#>'0"#$%&#%D1#=?"+*'&00)$$E
                       – !"#$%&'(#)&*+%,-+$&./&01%+$
                       – 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",
                           • 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+



                                              addresses from a warp
                                                          ...


                0     32    64      96     128     160     192     224     256     288     320     352     384    416   448


© NVIDIA 2010
*mem compression
!"#$%$&$'()*$#+),-%"./00$-'
                • 1+/')233)/30/)+20)4//')-"#$%$&/5)2'5)6/.'/3)$0)3$%$#/5)47)#+/)'8%4/.)-9)
                  47#/0)'//5/5:);-'0$5/.);-%"./00$-'
                • <"".-2;+/0=
                   – !"#$%&'"()*+,'"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,'"%8390-,# *):7,*)+%;%
                     &'7<=)>
                   – ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"
                       • A<23 82+B)2CD>%,+%+#'*;6)%'"=E1%"'%D;#F%,"+#*7&#,'"+
                   – G;"6)0-;+)H$
                       • I'.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+
                       • K;#;%,+%;"%,"H)L%A'*%,"#)*<'=;#,'"

                • <""3$;2#$-')$')".2;#$;/=
                   – M=;*J%!"#$%& NO'=(,"6%I;##,&)%PMK%+E+#)D+%'A%):7;#,'"+%7+,"6%D,L)H%<*)&,+,'"%
                     +'=()*+%'"%Q@R+S
                   – F##<$TT;*L,(U'*6T;-+TCV22U42V2
                                                                                                             34
© NVIDIA 2010
tation t hrough
                    compu
             g GPU
Acc eleratin          thods
       -precis ion me
mixed


                                                                     lark
                                                          M ichael C s
                                                                       ic
                                                            As trophys
                                            ian Ce nter for     Univers
                                                                        ity
                                         on          Harvard
                                  -Smiths
                          Harvard




              SC’10
... too much ?

                      ba nk c
                                 onflict
                                             s




            on
                                       ing



        isi
                              ale   sc


      ec
                           co




                                        ca
    pr




                                             ch
    d                part
                            ition




                                                 in
ixe
      cla                           ca m




                                                 g
            m                              ping
m


                pi
                     ng
                                 adca sting
                           bro
                                                  ms
        zero-cop                             trea
Parallel
ProgrammParking
     is Hard
   (but you’ll pick it up)
(you are not alone)
3. Threading/Execution
     Optimizations
3.1 Exec. Configuration
        Optimizations
Occupancy

       Thread instructions are executed sequentially, so
       executing other warps is the only way to hide
       latencies and keep the hardware busy

       Occupancy = Number of warps running concurrently
       on a multiprocessor divided by maximum number of
       warps that can run concurrently

       Limited by resource usage:
                 Registers
                 Shared memory



© NVIDIA Corporation 2008                                  60
Grid/Block Size Heuristics

       # of blocks > # of multiprocessors
                 So all multiprocessors have at least one block to execute


       # of blocks / # of multiprocessors > 2
                 Multiple blocks can run concurrently in a multiprocessor
                 Blocks that aren’t waiting at a __syncthreads() keep the
                 hardware busy
                 Subject to resource availability – registers, shared memory


       # of blocks > 100 to scale to future devices
                 Blocks executed in pipeline fashion
                 1000 blocks per grid will scale across multiple generations


© NVIDIA Corporation 2008                                                    61
Register Dependency

       Read-after-write register dependency
                 Instruction’s result can be read ~24 cycles later
                 Scenarios:     CUDA:                  PTX:
                                      x = y + 5;          add.f32 $f3, $f1, $f2
                                      z = x + 3;          add.f32 $f5, $f3, $f4

                                      s_data[0] += 3;     ld.shared.f32 $f3, [$r31+0]
                                                          add.f32       $f3, $f3, $f4


       To completely hide the latency:
                 Run at least 192 threads (6 warps) per multiprocessor
                            At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3)
                 Threads do not have to belong to the same thread block
© NVIDIA Corporation 2008                                                               62
Register Pressure
       Hide latency by using more threads per SM
       Limiting Factors:
                 Number of registers per kernel
                            8K/16K per SM, partitioned among concurrent threads
                 Amount of shared memory
                            16KB per SM, partitioned among concurrent threadblocks
       Compile with –ptxas-options=-v flag
       Use –maxrregcount=N flag to NVCC
                 N = desired maximum registers / kernel
                 At some point “spilling” into local memory may occur
                            Reduces performance – local memory is slow




© NVIDIA Corporation 2008                                                            63
Occupancy Calculator




© NVIDIA Corporation 2008   64
Optimizing threads per block
       Choose threads per block as a multiple of warp size
                 Avoid wasting computation on under-populated warps
       Want to run as many warps as possible per
       multiprocessor (hide latency)
       Multiprocessor can run up to 8 blocks at a time

       Heuristics
                 Minimum: 64 threads per block
                            Only if multiple concurrent blocks
                 192 or 256 threads a better choice
                            Usually still enough regs to compile and invoke successfully
                 This all depends on your computation, so experiment!


© NVIDIA Corporation 2008                                                             65
Occupancy != Performance


       Increasing occupancy does not necessarily increase
       performance



                                      BUT …



       Low-occupancy multiprocessors cannot adequately
       hide latency on memory-bound kernels
                 (It all comes down to arithmetic intensity and available
                 parallelism)

© NVIDIA Corporation 2008                                                   66
01*+ ,2%
                          ."$%/,,
                 ,"% *#%-(
       &"$'($ )*+
!"##"$%
                        (%)(*'
                !"#$%&'!
                         ).%.&'
                  +,'-./
                         '                   8'


                        ./'556'5787'
               0. 12.34


      GTC’10
Occupancy != Performance


       Increasing occupancy does not necessarily increase
       performance



                                      BUT …



       Low-occupancy multiprocessors cannot adequately
       hide latency on memory-bound kernels
                 (It all comes down to arithmetic intensity and available
                 parallelism)

© NVIDIA Corporation 2008                                                   66
3.2 Instruction
    Optimizations
CUDA Instruction Performance


       Instruction cycles (per warp) = sum of
                 Operand read cycles
                 Instruction execution cycles
                 Result update cycles


       Therefore instruction throughput depends on
                 Nominal instruction throughput
                 Memory latency
                 Memory bandwidth


       “Cycle” refers to the multiprocessor clock rate
                 1.3 GHz on the Tesla C1060, for example

© NVIDIA Corporation 2008                                  69
Maximizing Instruction Throughput

        Maximize use of high-bandwidth memory
                 Maximize use of shared memory
                 Minimize accesses to global memory
                 Maximize coalescing of global memory accesses


        Optimize performance by overlapping memory
        accesses with HW computation
                 High arithmetic intensity programs
                            i.e. high ratio of math to memory transactions
                 Many concurrent threads




© NVIDIA Corporation 2008                                                    70
Arithmetic Instruction Throughput

       int and float add, shift, min, max and float mul, mad:
       4 cycles per warp
                 int multiply (*) is by default 32-bit
                            requires multiple cycles / warp
                 Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int
                 multiply


       Integer divide and modulo are more expensive
                 Compiler will convert literal power-of-2 divides to shifts
                            But we have seen it miss some cases
                 Be explicit in cases where compiler can’t tell that divisor is
                 a power of 2!
                 Useful trick: foo % n == foo & (n-1) if n is a power of 2


© NVIDIA Corporation 2008                                                     71
Runtime Math Library


       There are two types of runtime math operations in
       single-precision
                 __funcf(): direct mapping to hardware ISA
                            Fast but lower accuracy (see prog. guide for details)
                            Examples: __sinf(x), __expf(x), __powf(x,y)
                 funcf() : compile to multiple instructions
                            Slower but higher accuracy (5 ulp or less)
                            Examples: sinf(x), expf(x), powf(x,y)


       The -use_fast_math compiler option forces every
       funcf() to compile to __funcf()


© NVIDIA Corporation 2008                                                           72
GPU results may not match CPU


       Many variables: hardware, compiler, optimization
       settings

       CPU operations aren’t strictly limited to 0.5 ulp
                 Sequences of operations can be more accurate due to 80-
                 bit extended precision ALUs


       Floating-point arithmetic is not associative!




© NVIDIA Corporation 2008                                              73
FP Math is Not Associative!


       In symbolic math, (x+y)+z == x+(y+z)
       This is not necessarily true for floating-point addition
                 Try x = 1030, y = -1030 and z = 1 in the above equation


       When you parallelize computations, you potentially
       change the order of operations

       Parallel results may not exactly match sequential
       results
                 This is not specific to GPU or CUDA – inherent part of
                 parallel execution


© NVIDIA Corporation 2008                                                  74
Control Flow Instructions
       Main performance concern with branching is
       divergence
                 Threads within a single warp take different paths
                 Different execution paths must be serialized


       Avoid divergence when branch condition is a
       function of thread ID
                 Example with divergence:
                            if (threadIdx.x > 2) { }
                            Branch granularity < warp size
                 Example without divergence:
                            if (threadIdx.x / WARP_SIZE > 2) { }
                            Branch granularity is a whole multiple of warp size


© NVIDIA Corporation 2008                                                         75
Scared ?
Scared ?
           Howwwwww?!
            (do I start)
Profiler
!"#$%&'&()'*+(,-./'$0-
                • ,-./'$0-(1.2"*0-&3
                   – !"#$%&'$!("#)!##&*+!"!"#$%&'$!("#)*,*'&$*+
                       • #$%&"'()*+,+(%+-"./"0 1+*"23*1
                       • 4'556+-7"'()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($%
                   – -.+)%*/&*#$!"-#$)%*/&*#$
                       • :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(
                       • :(5%*6)%'$(",3/".+")$6(%+-"';"'%"'5"41*+-')3%+-"$6%7
                   – .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2'$!("
                       • :()*+,+(%+-"./"0 1+*"45($'"0 =8'(+"'5"0>?#@
                   – &"'2'4*+)-.(12.).(2+)$%2"#2'$!("
                       • :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%'$(5
                • 6.78#-03
                   –   B>""D"!"#$%&'$!("#)!##&*+             <D"B>"E"23*1"5'F+"D<
                   – 0>?#"D"=-.(12.)#$(%*)$%2"#2'$!(" 56.0)-.(12.).(2+)3!##@
                                                                                         7
© NVIDIA 2010
CUDA Visual Profiler                     data for memory transfers

             Memory transfer type and direction
             (D=Device, H=Host, A=cuArray)
                e.g. H to D: Host to Device

                      Synchronous / Asynchronous

             Memory transfer size, in bytes

             Stream ID




© NVIDIA Corporation 2010
CUDA Visual Profiler   data for kernels




© NVIDIA Corporation 2010
CUDA Visual Profiler                    computed data for kernels
            Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

            Global memory read throughput (Gigabytes/second)

            Global memory write throughput (Gigabytes/second)

            Overall global memory access throughput (Gigabytes/second)

            Global memory load efficiency

            Global memory store efficiency




© NVIDIA Corporation 2010
CUDA Visual Profiler                data analysis views
             Views:
                 Summary table
                 Kernel table
                 Memcopy table
                 Summary plot
                 GPU Time Height plot
                 GPU Time Width plot
                 Profiler counter plot
                 Profiler table column plot
                 Multi-device plot
                 Multi-stream plot
             Analyze profiler counters
             Analyze kernel occupancy

© NVIDIA Corporation 2010
CUDA Visual Profiler                      Misc.
             Multiple sessions

             Compare views for different sessions

             Comparison Summary plot

             Profiler projects   save & load

            Import/Export profiler data
            (.CSV format)




© NVIDIA Corporation 2010
Scared ?
      meh!!!! I don’t like to
              profile
Modified source code
!"#$%&'&()'*+(,-.'/'0.(1-2340(5-.0
                • 6'70(707-3%8-"$%(#".(7#*+8-"$%(903&'-"&(-/(*+0(:03"0$
                  – !"#$%&'()&'*)+%#',-",'+)./,'-"0%'+","1+%2%.+%.,'*).,&)31(3)4')&'
                    "++&%##$.5
                  – 6$0%#'7)8'5))+'%#,$9",%#'()&:
                     • ;$9%'#2%.,'"**%##$.5'9%9)&7
                     • ;$9%'#2%.,'$.'%<%*8,$.5'$.#,&8*,$).#

                • 5-7;#3'"<(*+0(*'70&(/-3(7-.'/'0.(:03"0$&
                  – =%32#'+%*$+%'4-%,-%&',-%'>%&.%3'$#'9%9 )&'9",-'?)8.+
                  – @-)4#'-)4'4%33'9%9)&7')2%&",$).#'"&%')0%&3"22%+'4$,-'"&$,-9%,$*
                     • A)92"&%',-%'#89')('9%91).37'".+'9",-1).37',$9%#',)'(8331>%&.%3',$9%


                                                                                             9
© NVIDIA 2010
Scared ?

           I want to believe...
!"#$%&'(#)*$%!+$,(-."/


time




            mem math full         mem math full            mem math full             mem math full

         Memory-bound             Math-bound               Balanced                 Memory and latency bound
         Good mem-math            Good mem-math            Good mem-math            Poor mem-math overlap:
         overlap: latency not a
         problem
                                        Memory bound ?
                                  overlap: latency not a
                                  problem
                                                           overlap: latency not a
                                                           problem
                                                                                    latency is a problem

         (assuming memory         (assuming instruction    (assuming memory/instr
         throughput is not low    throughput is not low    throughput is not low

                                            Math bound ?
         compared to HW theory)   compared to HW theory)   compared to HW theory)
                                                                                                         13
© NVIDIA 2010




                                         Latency bound ?
!"#$%&'(#)*$%!+$,(-."/


time




            mem math full         mem math full            mem math full             mem math full

         Memory-bound             Math-bound               Balanced                 Memory and latency bound
         Good mem-math            Good mem-math            Good mem-math            Poor mem-math overlap:
         overlap: latency not a   overlap: latency not a   overlap: latency not a   latency is a problem
         problem                  problem                  problem
         (assuming memory         (assuming instruction    (assuming memory/instr
         throughput is not low    throughput is not low    throughput is not low
         compared to HW theory)   compared to HW theory)   compared to HW theory)
                                                                                                         13
© NVIDIA 2010
!"#$%&'(#)*$%!+$,(-."/


time




            mem math full         mem math full            mem math full             mem math full

         Memory-bound             Math-bound               Balanced                 Memory and latency bound
         Good mem-math            Good mem-math            Good mem-math            Poor mem-math overlap:
         overlap: latency not a   overlap: latency not a   overlap: latency not a   latency is a problem
         problem                  problem                  problem
         (assuming memory         (assuming instruction    (assuming memory/instr
         throughput is not low    throughput is not low    throughput is not low
         compared to HW theory)   compared to HW theory)   compared to HW theory)
                                                                                                         13
© NVIDIA 2010
Argn&%#$... too many optimizations !!!
Parameterize Your Application

       Parameterization helps adaptation to different GPUs

       GPUs vary in many ways
                 # of multiprocessors
                 Memory bandwidth
                 Shared memory size
                 Register file size
                 Max. threads per block

       You can even make apps self-tuning (like FFTW and
       ATLAS)
                 “Experiment” mode discovers and saves optimal
                 configuration

© NVIDIA Corporation 2008                                        67
More ?
•   Next week:
    GPU “Scripting”, Meta-programming, Auto-tuning
•   Thu 3/31/11:
    PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)
•   Tue 3/29/11:
    Algorithm Strategies (W. Hwu, UIUC)
•   Tue 4/5/11:
    Analysis-driven Optimization (C.Wooley, NVIDIA)
•   Thu 4/7/11:
    Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)
•   Thu 4/14/11:
    Optimization for Ninjas (D.Merill, UVirg)
•   ...
one more thing
           or two...
Life/Code Hacking #2.x
                Speed {listen,read,writ}ing




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2
                                                 Speed writing




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2
                                                       Speed writing
http://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html




      accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2
                                                   Speed writing
Typing tutors: gtypist, ktouch, typingweb.com, etc.




  accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2
                                                 Speed writing




                       Kinesis Advantage (QWERTY/DVORAK)
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Demo
CO ME

Contenu connexe

Tendances

JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013Vladimir Ivanov
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48Preferred Networks
 
Microservices Architecture - Cloud Native Apps
Microservices Architecture - Cloud Native AppsMicroservices Architecture - Cloud Native Apps
Microservices Architecture - Cloud Native AppsAraf Karsh Hamid
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesSlideTeam
 
Kubernetes Architecture
 Kubernetes Architecture Kubernetes Architecture
Kubernetes ArchitectureKnoldus Inc.
 
Presentation vmware building “your cloud”
Presentation   vmware building “your cloud”Presentation   vmware building “your cloud”
Presentation vmware building “your cloud”solarisyourep
 
Jvm言語とJava、切っても切れないその関係
Jvm言語とJava、切っても切れないその関係Jvm言語とJava、切っても切れないその関係
Jvm言語とJava、切っても切れないその関係yy yank
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpYahoo!デベロッパーネットワーク
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streamsconfluent
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Worksconfluent
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Preferred Networks
 
OCIランタイムの筆頭「runc」を俯瞰する
OCIランタイムの筆頭「runc」を俯瞰するOCIランタイムの筆頭「runc」を俯瞰する
OCIランタイムの筆頭「runc」を俯瞰するKohei Tokunaga
 
Anil_Testing_Trainer
Anil_Testing_TrainerAnil_Testing_Trainer
Anil_Testing_TrainerAnil Kumar
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring Tim Bell
 
Rabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationRabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationEmre Gündoğdu
 

Tendances (20)

JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
 
Microservices Architecture - Cloud Native Apps
Microservices Architecture - Cloud Native AppsMicroservices Architecture - Cloud Native Apps
Microservices Architecture - Cloud Native Apps
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
SRE & Kubernetes
SRE & KubernetesSRE & Kubernetes
SRE & Kubernetes
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
 
Kubernetes Architecture
 Kubernetes Architecture Kubernetes Architecture
Kubernetes Architecture
 
Presentation vmware building “your cloud”
Presentation   vmware building “your cloud”Presentation   vmware building “your cloud”
Presentation vmware building “your cloud”
 
Jvm言語とJava、切っても切れないその関係
Jvm言語とJava、切っても切れないその関係Jvm言語とJava、切っても切れないその関係
Jvm言語とJava、切っても切れないその関係
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
 
OCIランタイムの筆頭「runc」を俯瞰する
OCIランタイムの筆頭「runc」を俯瞰するOCIランタイムの筆頭「runc」を俯瞰する
OCIランタイムの筆頭「runc」を俯瞰する
 
Anil_Testing_Trainer
Anil_Testing_TrainerAnil_Testing_Trainer
Anil_Testing_Trainer
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
Rabbitmq & Kafka Presentation
Rabbitmq & Kafka PresentationRabbitmq & Kafka Presentation
Rabbitmq & Kafka Presentation
 

Similaire à Massively Parallel Computing CS 264 Lecture 5: Memory Optimizations

NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1Damir Bersinic
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
CSC1100 - Chapter05 - Storage
CSC1100 - Chapter05 - StorageCSC1100 - Chapter05 - Storage
CSC1100 - Chapter05 - StorageYhal Htet Aung
 
Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3gelfstrom
 
Ph.D. thesis presentation
Ph.D. thesis presentationPh.D. thesis presentation
Ph.D. thesis presentationdavidkftam
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
ROM Hacking for Fun, Profit & Infinite Lives
ROM Hacking for Fun, Profit & Infinite LivesROM Hacking for Fun, Profit & Infinite Lives
ROM Hacking for Fun, Profit & Infinite LivesUlisses Albuquerque
 
Managing Exadata in the Real World
Managing Exadata in the Real WorldManaging Exadata in the Real World
Managing Exadata in the Real WorldEnkitec
 
Track A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, WindriverTrack A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, Windriverchiportal
 

Similaire à Massively Parallel Computing CS 264 Lecture 5: Memory Optimizations (20)

NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
Prairie DevCon-What's New in Hyper-V in Windows Server "8" Beta - Part 1
 
Windows 8 Hyper-V: Scalability
Windows 8 Hyper-V: ScalabilityWindows 8 Hyper-V: Scalability
Windows 8 Hyper-V: Scalability
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
CSC1100 - Chapter05 - Storage
CSC1100 - Chapter05 - StorageCSC1100 - Chapter05 - Storage
CSC1100 - Chapter05 - Storage
 
Concept of thread
Concept of threadConcept of thread
Concept of thread
 
Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3
 
Ph.D. thesis presentation
Ph.D. thesis presentationPh.D. thesis presentation
Ph.D. thesis presentation
 
You suck at Memory Analysis
You suck at Memory AnalysisYou suck at Memory Analysis
You suck at Memory Analysis
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
 
Hpc4
Hpc4Hpc4
Hpc4
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
ROM Hacking for Fun, Profit & Infinite Lives
ROM Hacking for Fun, Profit & Infinite LivesROM Hacking for Fun, Profit & Infinite Lives
ROM Hacking for Fun, Profit & Infinite Lives
 
Managing Exadata in the Real World
Managing Exadata in the Real WorldManaging Exadata in the Real World
Managing Exadata in the Real World
 
Ch5
Ch5Ch5
Ch5
 
Track A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, WindriverTrack A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, Windriver
 
Hot sec10 slide-suzaki
Hot sec10 slide-suzakiHot sec10 slide-suzaki
Hot sec10 slide-suzaki
 

Plus de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 

Plus de npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 

Dernier

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 

Dernier (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 

Massively Parallel Computing CS 264 Lecture 5: Memory Optimizations

  • 1. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #5: Advanced CUDA | February 22th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2. Administrivia • HW2: out, due Mon 3/14/11 (not Fri 3/11/11) • Projects: think about it, consult the staff (*), proposals due ~ Fri 3/25/11 • Guest lectures: • schedule coming soon • on Fridays 7.35-9.35pm (March, April) ?
  • 3. During this course, r CS264 adapted fo we’ll try to “ ” and use existing material ;-)
  • 5. Outline 1. Hardware Review 2. Memory/Communication Optimizations 3. Threading/Execution Optimizations
  • 7. 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors, each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Multiprocessor Thread Processors Double Shared Memory © NVIDIA Corporation 2008 8
  • 8. Threading Hierarchy Execution Model Software Hardware Threads are executed by thread Thread processors Processor Thread Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can Thread reside on one multiprocessor - limited Block Multiprocessor by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks ... Only one kernel can execute on a Grid device at one time Device © 2008 NVIDIA Corporation.
  • 9. Warps and Half Warps 32 Threads A thread block consists of 32- thread warps ... = 32 Threads A warp is executed physically in 32 Threads parallel (SIMD) on a Thread Block Warps Multiprocessor multiprocessor DRAM A half-warp of 16 threads can coordinate global memory 16 16 Global accesses into a single transaction Half Warps Local Device Memory © NVIDIA Corporation 2008 10
  • 10. Memory Architecture Host Device GPU CPU DRAM Multiprocessor Registers Local Multiprocessor Shared Memory Registers Multiprocessor Chipset Shared Memory Registers Global Shared Memory DRAM Constant Constant and Texture Caches Texture © NVIDIA Corporation 2008 11
  • 11. Kernel Memory Access Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent across Time Memory kernel launches Kernel 1 ... • Kernel I/O
  • 12. Global Memory Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent across Time Memory kernel launches Kernel 1 ... • Kernel I/O
  • 13. Memory Architecture Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application © NVIDIA Corporation 2008 12
  • 14. 2. Memory/Communication Optimizations
  • 15. Revie w 2.1 Host/Device Transfer Optimizations
  • 16. Rev ie w PC Architecture 8 GB/s >?@ ?>L9G=2%&66"K16 J%+8#"F7(&"K16 H%'2$7,6">'%("I" A+%#$)%7(B& F+1#$)%7(B& >@C! E&.+%/"K16 ?>L"K16 3+ Gb/s CD!E F!:! G#$&%8&# ! 160+ GB/s to VRAM 25+ GB/s modified from Matthew Bolitho
  • 17. Rev ie w The PCI-“not-so”-e Bus • PCIe bus is slow • Try to minimize/group transfers • Use pinned memory on host whenever possible • Try to perform copies asynchronously (e.g. Streams) • Use “Zero-Copy” when appropriate • Examples in the SDK (e.g. bandwidthTest)
  • 18. 2.2 Device Memory Optimizations
  • 19. Definitions • gmem: global memory • smem: shared memory • tmem: texture memory • cmem: constant memory • bmem: binary code (cubin) memory ?!? (covered next week)
  • 21. Matrix Transpose Transpose 2048x2048 matrix of floats Performed out-of-place Separate input and output matrices Use tile of 32x32 elements, block of 32x8 threads Each thread processes 4 matrix elements In general tile and block size are fair game for optimization Process Get the right answer Measure effective bandwidth (relative to theoretical or reference case) Address global memory coalescing, shared memory bank conflicts, and partition camping while repeating above steps © NVIDIA Corporation 2008 22
  • 22. Theoretical Bandwidth Device Bandwidth of GTX 280 DDR 1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s Memory Memory clock (Hz) interface (bytes) Specs report 141 GB/s Use 10^9 B/GB conversion rather than 1024^3 Whichever you use, be consistent © NVIDIA Corporation 2008 23
  • 23. Effective Bandwidth Transpose Effective Bandwidth 2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s Matrix size Read and (bytes) write Reference Case - Matrix Copy Transpose operates on tiles - need better comparison than raw device bandwidth Look at effective bandwidth of copy that uses tiles © NVIDIA Corporation 2008 24
  • 24. Matrix Copy Kernel __global__ void copy(float *odata, float *idata, int width, int height) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex; for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } TILE_DIM = 32 } BLOCK_ROWS = 8 idata odata 32x32 tile 32x8 thread block idata and odata in global memory Elements copied by a half-warp of threads © NVIDIA Corporation 2008 25
  • 25. Matrix Copy Kernel Timing Measure elapsed time over loop Looping/timing done in two ways: Over kernel launches (nreps = 1) Includes launch/indexing overhead Within the kernel over loads/stores (nreps > 1) Amortizes launch/indexing overhead __global__ void copy(float *odata, float* idata, int width, int height, int nreps) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex; for (int r = 0; r < nreps; r++) { for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } } } © NVIDIA Corporation 2008 26
  • 26. Naïve Transpose Similar to copy Input and output matrices have different indices __global__ void transposeNaive(float *odata, float* idata, int width, int height, int nreps) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex; for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i] = idata[index_in+i*width]; } } idata odata } © NVIDIA Corporation 2008 27
  • 27. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over Loop in kernel kernel Simple Copy 96.9 81.6 Naïve 2.2 2.2 Transpose © NVIDIA Corporation 2008 28
  • 28.
  • 30. Memory Coalescing GPU memory controller granularity is 64 or 128 bytes Must also be 64 or 128 byte aligned Suppose thread loads a float (4 bytes) Controller loads 64 bytes, throws 60 bytes away
  • 31. Memory Coalescing Memory controller actually more intelligent Consider half-warp (16 threads) Suppose each thread reads consecutive float Memory controller will perform one 64 byte load This is known as coalescing Make threads read consecutive locations
  • 32. Coalescing Global memory access of 32, 64, or 128-bit words by a half- warp of threads can result in as few as one (or two) transaction(s) if certain access requirements are met Depends on compute capability 1.0 and 1.1 have stricter access requirements Examples – float (32-bit) data Global Memory } 64B aligned segment (16 floats) } 128B aligned segment (32 floats) Half-warp of threads © NVIDIA Corporation 2008 30
  • 33. Coalescing Compute capability 1.0 and 1.1 K-th thread must access k-th word in the segment (or k-th word in 2 contiguous 128B segments for 128-bit words), not all threads need to participate Coalesces – 1 transaction Out of sequence – 16 transactions Misaligned – 16 transactions © NVIDIA Corporation 2008 31
  • 34. Memory Coalescing GT200 has hardware coalescer Inspects memory requests from each half-warp Determines minimum set of transactions which are 64 or 128 bytes long 64 or 128 byte aligned
  • 35. Coalescing Compute capability 1.2 and higher (e.g. GT200 like the C1060) Coalescing is achieved for any pattern of addresses that fits into a segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit words Smaller transactions may be issued to avoid wasted bandwidth due to unused words 1 transaction - 64B segment 1 transaction - 128B segment 2 transactions - 64B and 32B segments © NVIDIA Corporation 2008 32
  • 36. Coalescing Compute capability 2.0 (Fermi, Tesla C2050) Memory transactions handled per warp (32 threads) L1 cache ON: Issues always 128B segment transactions caches them in 16kB or 48kB L1 cache per multiprocessor 2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache. L1 cache OFF: Issues always 32B segment transactions E.g. advantage for widely scattered thread accesses 32 transactions - 32 x 32B segments, instead of 32 x 128B segments. © NVIDIA Corporation 2010
  • 37. Coalescing Summary Coalescing dramatically speeds global memory access Strive for perfect coalescing: Align starting address (may require padding) A warp should access within a contiguous region
  • 38. Coalescing in Transpose Naïve transpose coalesces reads, but not writes idata odata Elements transposed by a half-warp of threads Q: How to coalesce writes ? © NVIDIA Corporation 2008 33
  • 39. smem as a cache
  • 40. Shared Memory SMs can access gmem at 80+ GiB/sec but have hundreds of cycles of latency Each SM has 16 kiB ‘shared’ memory Essentially user-managed cache Speed comparable to registers Accessible to all threads in a block Reduces load/stores to device memory
  • 41. Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing © NVIDIA Corporation 2008 34
  • 42. Coalescing in Transpose Naïve transpose coalesces reads, but not writes idata odata Elements transposed by a half-warp of threads Q: How to coalesce writes ? © NVIDIA Corporation 2008 33
  • 43. Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing © NVIDIA Corporation 2008 34
  • 44. A Common Programming Strategy !   Partition data into subsets that fit into shared memory © 2008 NVIDIA Corporation
  • 45. A Common Programming Strategy !   Handle each data subset with one thread block © 2008 NVIDIA Corporation
  • 46. A Common Programming Strategy !   Load the subset from global memory to shared memory, using multiple threads to exploit memory- level parallelism © 2008 NVIDIA Corporation
  • 47. A Common Programming Strategy !   Perform the computation on the subset from shared memory © 2008 NVIDIA Corporation
  • 48. A Common Programming Strategy !   Copy the result from shared memory back to global memory © 2008 NVIDIA Corporation
  • 49. Coalescing through shared memory Access columns of a tile in shared memory to write contiguous data to global memory Requires __syncthreads() since threads write data read by other threads idata odata tile Elements transposed by a half-warp of threads © NVIDIA Corporation 2008 35
  • 50. Coalescing through shared memory __global__ void transposeCoalesced(float *odata, float *idata, int width, int height, int nreps) { __shared__ float tile[TILE_DIM][TILE_DIM]; int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height; for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } } } © NVIDIA Corporation 2008 36
  • 51. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over kernel Loop in kernel Simple Copy 96.9 81.6 Uses shared memory tile Shared Memory Copy 80.9 81.1 and Naïve Transpose 2.2 2.2 __syncthreads() Coalesced Transpose 16.5 17.1 © NVIDIA Corporation 2008 37
  • 52.
  • 54. Shared Memory Architecture Many threads accessing memory Therefore, memory is divided into banks Successive 32-bit words assigned to successive banks Each bank can service one address per cycle Bank 0 A memory can service as many simultaneous Bank 1 accesses as it has banks Bank 2 Bank 3 Bank 4 Multiple simultaneous accesses to a bank Bank 5 result in a bank conflict Bank 6 Conflicting accesses are serialized Bank 7 Bank 15 © NVIDIA Corporation 2008 39
  • 55. Shared Memory Banks Bank 0 0 16 Bank 1 1 17 Bank 2 2 18 Shared memory divided Bank 3 Bank 4 3 4 19 20 into 16 ‘banks’ Bank 5 5 21 Bank 6 6 22 Shared memory is (almost) Bank 7 7 23 as fast as registers (...) Bank 8 8 24 Bank 9 9 25 Exception is in case of Bank 10 Bank 11 10 11 26 27 bank conflicts Bank 12 12 28 Bank 13 13 29 Bank 14 14 30 Bank 15 15 31 4 bytes
  • 56. Bank Addressing Examples No Bank Conflicts No Bank Conflicts Linear addressing Random 1:1 Permutation stride == 1 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 © NVIDIA Corporation 2008 40
  • 57. Bank Addressing Examples 2-way Bank Conflicts 8-way Bank Conflicts Linear addressing Linear addressing stride == 2 stride == 8 x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 © NVIDIA Corporation 2008 41
  • 58. Shared memory bank conflicts Shared memory is ~ as fast as registers if there are no bank conflicts warp_serialize profiler signal reflects conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © NVIDIA Corporation 2008 42
  • 59. Bank Conflicts in Transpose 32x32 shared memory tile of floats Data in columns k and k+16 are in same bank 16-way bank conflict reading half columns in tile Solution - pad shared memory array __shared__ float tile[TILE_DIM][TILE_DIM+1]; Q: How to avoid bank conflicts ? Data in anti-diagonals are in same bank idata odata tile © NVIDIA Corporation 2008 43
  • 60. Bank Conflicts in Transpose 32x32 shared memory tile of floats Data in columns k and k+16 are in same bank 16-way bank conflict reading half columns in tile Solution - pad shared memory array __shared__ float tile[TILE_DIM][TILE_DIM+1]; Data in anti-diagonals are in same bank idata odata tile © NVIDIA Corporation 2008 43
  • 61. Illustration !"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/78 • :;<:; !(=('#$$#+ • >#$?'#77%99%9'#'7*6@)0, – !"#$%&'(%)*'+,)-./+01'20345%61'/)'%'$%47'%++511'035'1%85'(%)*9 warps: 0 1 2 31 0 1 2 31 Bank 0 Bank 1 0 1 2 31 … 0 1 2 31 Bank 31 0 1 2 31 NVIDIA 2010
  • 62. Illustration !"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789 • -&&'#'7*6:)0'5*$';#&&/01, – !"#!! $%&%'())(* • <#$;'#77%99%9'#'7*6:)0, – !" +,--.)./0'1(/234'/5'1(/2'65/-7,603 warps: 0 1 2 31 padding 0 1 2 31 Bank 0 Bank 1 0 1 2 31 … 0 1 2 31 Bank 31 0 1 2 31 © NVIDIA 2010
  • 63. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over Loop in kernel kernel Simple Copy 96.9 81.6 Shared Memory Copy 80.9 81.1 Naïve Transpose 2.2 2.2 Coalesced Transpose 16.5 17.1 Bank Conflict Free Transpose 16.6 17.2 © NVIDIA Corporation 2008 44
  • 68. Partition Camping Global memory accesses go through partitions 6 partitions on 8-series GPUs, 8 partitions on 10-series GPUs Successive 256-byte regions of global memory are assigned to successive partitions For best performance: Simultaneous global memory accesses GPU-wide should be distributed evenly amongst partitions Partition Camping occurs when global memory accesses at an instant use a subset of partitions Directly analogous to shared memory bank conflicts, but on a larger scale © NVIDIA Corporation 2008 46
  • 69. Partition Camping in Transpose Partition width = 256 bytes = 64 floats Twice width of tile On GTX280 (8 partitions), data 2KB apart map to same partition 2048 floats divides evenly by 2KB => columns of matrices map to same partition idata odata tiles in matrices 0 1 2 3 4 5 0 64 128 colors = partitions 64 65 66 67 68 69 1 65 129 128 129 130 ... 2 66 130 3 67 ... 4 68 5 69 blockId = gridDim.x * blockIdx.y + blockIdx.x © NVIDIA Corporation 2008 47
  • 70. Partition Camping Solutions Pad matrices (by two tiles) In general might be expensive/prohibitive memory-wise Diagonally reorder blocks Interpret blockIdx.y as different diagonal slices and blockIdx.x as distance along a diagonal idata odata 0 64 128 0 1 65 129 64 1 2 66 130 128 65 2 3 67 ... 129 66 3 4 68 130 67 4 5 ... 68 5 blockId = gridDim.x * blockIdx.y + blockIdx.x © NVIDIA Corporation 2008 48
  • 71. Diagonal Transpose __global__ void transposeDiagonal(float *odata, float *idata, int width, int height, int nreps) { __shared__ float tile[TILE_DIM][TILE_DIM+1]; int blockIdx_y = blockIdx.x; Add lines to map diagonal int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x; to Cartesian coordinates int xIndex = blockIdx_x * TILE_DIM + threadIdx.x; int yIndex = blockIdx_y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; Replace xIndex = blockIdx_y * TILE_DIM + threadIdx.x; yIndex = blockIdx_x * TILE_DIM + threadIdx.y; blockIdx.x int index_out = xIndex + (yIndex)*height; with blockIdx_x, for (int r=0; r < nreps; r++) { blockIdx.y for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { with tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; blockIdx_y } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } } } © NVIDIA Corporation 2008 49
  • 72. Diagonal Transpose Previous slide for square matrices (width == height) More generally: if (width == height) { blockIdx_y = blockIdx.x; blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x; } else { int bid = blockIdx.x + gridDim.x*blockIdx.y; blockIdx_y = bid%gridDim.y; blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x; } © NVIDIA Corporation 2008 50
  • 73. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over kernel Loop in kernel Simple Copy 96.9 81.6 Shared Memory Copy 80.9 81.1 Naïve Transpose 2.2 2.2 Coalesced Transpose 16.5 17.1 Bank Conflict Free Transpose 16.6 17.2 Diagonal 69.5 78.3 © NVIDIA Corporation 2008 51
  • 74. Order of Optimizations Larger optimization issues can mask smaller ones Proper order of some optimization techniques in not known a priori Eg. partition camping is problem-size dependent Don’t dismiss an optimization technique as ineffective until you know it was applied at the right time Bank Partition Conflicts 16.6 GB/s Camping Naïve Coalescing 16.5 GB/s 69.5 GB/s 2.2 GB/s Partition 48.8 GB/s Bank Camping Conflicts © NVIDIA Corporation 2008 52
  • 75. Transpose Summary Coalescing and shared memory bank conflicts are small-scale phenomena Deal with memory access within half-warp Problem-size independent Partition camping is a large-scale phenomenon Deals with simultaneous memory accesses by warps on different multiprocessors Problem size dependent Wouldn’t see in (2048+32)^2 matrix Coalescing is generally the most critical SDK Transpose Example: © NVIDIA Corporation 2008 http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html 53
  • 76. tmem
  • 77. Textures in CUDA Texture is an object for reading data Benefits: Data is cached (optimized for 2D locality) Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for “out-of-bounds” addresses) Clamp to edge / repeat Addressable in 1D, 2D, or 3D Using integer or normalized coordinates Usage: CPU code binds data to a texture object Kernel reads data by calling a fetch function © NVIDIA Corporation 2008 55
  • 78. Other goodies Optional “format conversion” • {char, short, int, half} (16bit) to float (32bit) • “for free” • useful for *mem compression (see later)
  • 79. Texture Addressing 0 1 2 3 4 0 (2.5, 0.5) 1 (1.0, 1.0) 2 3 Wrap Clamp Out-of-bounds coordinate is Out-of-bounds coordinate is wrapped (modulo arithmetic) replaced with the closest boundary 0 1 2 3 4 0 1 2 3 4 0 0 (5.5, 1.5) (5.5, 1.5) 1 1 2 2 3 3 © NVIDIA Corporation 2008 56
  • 80. Two CUDA Texture Types Bound to linear memory Global memory address is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, or 3D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both: Return either element type or normalized float © NVIDIA Corporation 2008 57
  • 81. CUDA Texturing Steps Host (CPU) code: Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code: Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() or tex3D() © NVIDIA Corporation 2008 58
  • 82. cmem
  • 83. !"#$%&#%'()*"+, • -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$ • 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13) – !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$% – 1'$(#$+2(3.(/.'0(32(456(7./$.+% – 8,9,&.0(&#(:;<= • <)+*2'&..$'4#20"+*'&11)$$)$= – <./$.+(>#,$&./('/?*9.$&()*'+,-,.0(@,&A(!"#$% – 1#9>,+./(9*%&(0.&./9,$.(&A'&('++(&A/.'0%(,$('(&A/.'03+#"7 @,++(0./.-./.$".(&A.(%'9.('00/.%% – B#(+,9,&(#$('//'2(%,C.D("'$(*%.('$2(?+#3'+(9.9#/2(>#,$&./ • !"#$%&#%'1&13)'%3+"49374%=' – EF(3,&%(>./(@'/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/ – G#(3.(*%.0(@A.$('++(&A/.'0%(,$('(@'/>(/.'0(&A.(%'9.('00/.%% • H./,'+,C.%(#&A./@,%. © NVIDIA 2010
  • 84. !"#$%&#%'()*"+, • -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$ • 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13) – !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$% !!>+#3'+!!(?#,0(7./$.+@("#$%&(-+#'&(A>!' B – 1'$(#$+2(3.(/.'0(32(456(7./$.+% – 8,9,&.0(&#(:;<= C DDD • <)+*2'&..$'4#20"+*'&11)$$)$= -+#'&(O(P(>!'QRSTU(((((((((((((((((((VV(*$,-#/9 – <./$.+(E#,$&./('/>*9.$&()*'+,-,.0(F,&G(!"#$% -+#'&(2(P(>!'Q3+#"7W0ODOXSTU((((VV(*$,-#/9 – 1#9E,+./(9*%&(0.&./9,$.(&G'&('++(&G/.'0%(,$('(&G/.'03+#"7 F,++(0./.-./.$".(&G.(%'9.('00/.%% -+#'&(I(P(>!'Q&G/.'0W0ODOTU((((((VV($#$Y*$,-#/9 – H#(+,9,&(#$('//'2(%,I.J("'$(*%.('$2(>+#3'+(9.9#/2(E#,$&./ DDD • !"#$%&#%'1&13)'%3+"49374%=' Z – KL(3,&%(E./(F'/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/ – M#(3.(*%.0(FG.$('++(&G/.'0%(,$('(F'/E(/.'0(&G.(%'9.('00/.%% • N./,'+,I.%(#&G./F,%. © NVIDIA 2010
  • 85. !"#$%&#%'()*"+, • -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*) • @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5 • C$=#>'D(E(F – !"#$%&"'(%)*+#$*,%-./%01%234/%5)%67,%+'"))8# – 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*, addresses from a warp ... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 © NVIDIA 2010
  • 86. !"#$%&#%'()*"+, • -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*) • @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5 • C$=#>'0"#$%&#%D1#=?"+*'&00)$$E – !"#$%&'(#)&*+%,-+$&./&01%+$ – 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**", • 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+ addresses from a warp ... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 © NVIDIA 2010
  • 88. !"#$%$&$'()*$#+),-%"./00$-' • 1+/')233)/30/)+20)4//')-"#$%$&/5)2'5)6/.'/3)$0)3$%$#/5)47)#+/)'8%4/.)-9) 47#/0)'//5/5:);-'0$5/.);-%"./00$-' • <"".-2;+/0= – !"#$%&'"()*+,'"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,'"%8390-,# *):7,*)+%;% &'7<=)> – ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'" • A<23 82+B)2CD>%,+%+#'*;6)%'"=E1%"'%D;#F%,"+#*7&#,'"+ – G;"6)0-;+)H$ • I'.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+ • K;#;%,+%;"%,"H)L%A'*%,"#)*<'=;#,'" • <""3$;2#$-')$')".2;#$;/= – M=;*J%!"#$%& NO'=(,"6%I;##,&)%PMK%+E+#)D+%'A%):7;#,'"+%7+,"6%D,L)H%<*)&,+,'"% +'=()*+%'"%Q@R+S – F##<$TT;*L,(U'*6T;-+TCV22U42V2 34 © NVIDIA 2010
  • 89. tation t hrough compu g GPU Acc eleratin thods -precis ion me mixed lark M ichael C s ic As trophys ian Ce nter for Univers ity on Harvard -Smiths Harvard SC’10
  • 90. ... too much ? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ition in ixe cla ca m g m ping m pi ng adca sting bro ms zero-cop trea
  • 91. Parallel ProgrammParking is Hard (but you’ll pick it up)
  • 92. (you are not alone)
  • 93. 3. Threading/Execution Optimizations
  • 94. 3.1 Exec. Configuration Optimizations
  • 95. Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited by resource usage: Registers Shared memory © NVIDIA Corporation 2008 60
  • 96. Grid/Block Size Heuristics # of blocks > # of multiprocessors So all multiprocessors have at least one block to execute # of blocks / # of multiprocessors > 2 Multiple blocks can run concurrently in a multiprocessor Blocks that aren’t waiting at a __syncthreads() keep the hardware busy Subject to resource availability – registers, shared memory # of blocks > 100 to scale to future devices Blocks executed in pipeline fashion 1000 blocks per grid will scale across multiple generations © NVIDIA Corporation 2008 61
  • 97. Register Dependency Read-after-write register dependency Instruction’s result can be read ~24 cycles later Scenarios: CUDA: PTX: x = y + 5; add.f32 $f3, $f1, $f2 z = x + 3; add.f32 $f5, $f3, $f4 s_data[0] += 3; ld.shared.f32 $f3, [$r31+0] add.f32 $f3, $f3, $f4 To completely hide the latency: Run at least 192 threads (6 warps) per multiprocessor At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3) Threads do not have to belong to the same thread block © NVIDIA Corporation 2008 62
  • 98. Register Pressure Hide latency by using more threads per SM Limiting Factors: Number of registers per kernel 8K/16K per SM, partitioned among concurrent threads Amount of shared memory 16KB per SM, partitioned among concurrent threadblocks Compile with –ptxas-options=-v flag Use –maxrregcount=N flag to NVCC N = desired maximum registers / kernel At some point “spilling” into local memory may occur Reduces performance – local memory is slow © NVIDIA Corporation 2008 63
  • 99. Occupancy Calculator © NVIDIA Corporation 2008 64
  • 100. Optimizing threads per block Choose threads per block as a multiple of warp size Avoid wasting computation on under-populated warps Want to run as many warps as possible per multiprocessor (hide latency) Multiprocessor can run up to 8 blocks at a time Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 192 or 256 threads a better choice Usually still enough regs to compile and invoke successfully This all depends on your computation, so experiment! © NVIDIA Corporation 2008 65
  • 101. Occupancy != Performance Increasing occupancy does not necessarily increase performance BUT … Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism) © NVIDIA Corporation 2008 66
  • 102. 01*+ ,2% ."$%/,, ,"% *#%-( &"$'($ )*+ !"##"$% (%)(*' !"#$%&'! ).%.&' +,'-./ ' 8' ./'556'5787' 0. 12.34 GTC’10
  • 103. Occupancy != Performance Increasing occupancy does not necessarily increase performance BUT … Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism) © NVIDIA Corporation 2008 66
  • 104. 3.2 Instruction Optimizations
  • 105. CUDA Instruction Performance Instruction cycles (per warp) = sum of Operand read cycles Instruction execution cycles Result update cycles Therefore instruction throughput depends on Nominal instruction throughput Memory latency Memory bandwidth “Cycle” refers to the multiprocessor clock rate 1.3 GHz on the Tesla C1060, for example © NVIDIA Corporation 2008 69
  • 106. Maximizing Instruction Throughput Maximize use of high-bandwidth memory Maximize use of shared memory Minimize accesses to global memory Maximize coalescing of global memory accesses Optimize performance by overlapping memory accesses with HW computation High arithmetic intensity programs i.e. high ratio of math to memory transactions Many concurrent threads © NVIDIA Corporation 2008 70
  • 107. Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are more expensive Compiler will convert literal power-of-2 divides to shifts But we have seen it miss some cases Be explicit in cases where compiler can’t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2 © NVIDIA Corporation 2008 71
  • 108. Runtime Math Library There are two types of runtime math operations in single-precision __funcf(): direct mapping to hardware ISA Fast but lower accuracy (see prog. guide for details) Examples: __sinf(x), __expf(x), __powf(x,y) funcf() : compile to multiple instructions Slower but higher accuracy (5 ulp or less) Examples: sinf(x), expf(x), powf(x,y) The -use_fast_math compiler option forces every funcf() to compile to __funcf() © NVIDIA Corporation 2008 72
  • 109. GPU results may not match CPU Many variables: hardware, compiler, optimization settings CPU operations aren’t strictly limited to 0.5 ulp Sequences of operations can be more accurate due to 80- bit extended precision ALUs Floating-point arithmetic is not associative! © NVIDIA Corporation 2008 73
  • 110. FP Math is Not Associative! In symbolic math, (x+y)+z == x+(y+z) This is not necessarily true for floating-point addition Try x = 1030, y = -1030 and z = 1 in the above equation When you parallelize computations, you potentially change the order of operations Parallel results may not exactly match sequential results This is not specific to GPU or CUDA – inherent part of parallel execution © NVIDIA Corporation 2008 74
  • 111. Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths must be serialized Avoid divergence when branch condition is a function of thread ID Example with divergence: if (threadIdx.x > 2) { } Branch granularity < warp size Example without divergence: if (threadIdx.x / WARP_SIZE > 2) { } Branch granularity is a whole multiple of warp size © NVIDIA Corporation 2008 75
  • 113. Scared ? Howwwwww?! (do I start)
  • 115. !"#$%&'&()'*+(,-./'$0- • ,-./'$0-(1.2"*0-&3 – !"#$%&'$!("#)!##&*+!"!"#$%&'$!("#)*,*'&$*+ • #$%&"'()*+,+(%+-"./"0 1+*"23*1 • 4'556+-7"'()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($% – -.+)%*/&*#$!"-#$)%*/&*#$ • :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$( • :(5%*6)%'$(",3/".+")$6(%+-"';"'%"'5"41*+-')3%+-"$6%7 – .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2'$!(" • :()*+,+(%+-"./"0 1+*"45($'"0 =8'(+"'5"0>?#@ – &"'2'4*+)-.(12.).(2+)$%2"#2'$!(" • :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%'$(5 • 6.78#-03 – B>""D"!"#$%&'$!("#)!##&*+ <D"B>"E"23*1"5'F+"D< – 0>?#"D"=-.(12.)#$(%*)$%2"#2'$!(" 56.0)-.(12.).(2+)3!##@ 7 © NVIDIA 2010
  • 116. CUDA Visual Profiler data for memory transfers Memory transfer type and direction (D=Device, H=Host, A=cuArray) e.g. H to D: Host to Device Synchronous / Asynchronous Memory transfer size, in bytes Stream ID © NVIDIA Corporation 2010
  • 117. CUDA Visual Profiler data for kernels © NVIDIA Corporation 2010
  • 118. CUDA Visual Profiler computed data for kernels Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate Global memory read throughput (Gigabytes/second) Global memory write throughput (Gigabytes/second) Overall global memory access throughput (Gigabytes/second) Global memory load efficiency Global memory store efficiency © NVIDIA Corporation 2010
  • 119. CUDA Visual Profiler data analysis views Views: Summary table Kernel table Memcopy table Summary plot GPU Time Height plot GPU Time Width plot Profiler counter plot Profiler table column plot Multi-device plot Multi-stream plot Analyze profiler counters Analyze kernel occupancy © NVIDIA Corporation 2010
  • 120. CUDA Visual Profiler Misc. Multiple sessions Compare views for different sessions Comparison Summary plot Profiler projects save & load Import/Export profiler data (.CSV format) © NVIDIA Corporation 2010
  • 121. Scared ? meh!!!! I don’t like to profile
  • 123. !"#$%&'&()'*+(,-.'/'0.(1-2340(5-.0 • 6'70(707-3%8-"$%(#".(7#*+8-"$%(903&'-"&(-/(*+0(:03"0$ – !"#$%&'()&'*)+%#',-",'+)./,'-"0%'+","1+%2%.+%.,'*).,&)31(3)4')&' "++&%##$.5 – 6$0%#'7)8'5))+'%#,$9",%#'()&: • ;$9%'#2%.,'"**%##$.5'9%9)&7 • ;$9%'#2%.,'$.'%<%*8,$.5'$.#,&8*,$).# • 5-7;#3'"<(*+0(*'70&(/-3(7-.'/'0.(:03"0$& – =%32#'+%*$+%'4-%,-%&',-%'>%&.%3'$#'9%9 )&'9",-'?)8.+ – @-)4#'-)4'4%33'9%9)&7')2%&",$).#'"&%')0%&3"22%+'4$,-'"&$,-9%,$* • A)92"&%',-%'#89')('9%91).37'".+'9",-1).37',$9%#',)'(8331>%&.%3',$9% 9 © NVIDIA 2010
  • 124. Scared ? I want to believe...
  • 125. !"#$%&'(#)*$%!+$,(-."/ time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a problem Memory bound ? overlap: latency not a problem overlap: latency not a problem latency is a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low Math bound ? compared to HW theory) compared to HW theory) compared to HW theory) 13 © NVIDIA 2010 Latency bound ?
  • 126. !"#$%&'(#)*$%!+$,(-."/ time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem problem problem problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13 © NVIDIA 2010
  • 127. !"#$%&'(#)*$%!+$,(-."/ time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem problem problem problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13 © NVIDIA 2010
  • 128. Argn&%#$... too many optimizations !!!
  • 129. Parameterize Your Application Parameterization helps adaptation to different GPUs GPUs vary in many ways # of multiprocessors Memory bandwidth Shared memory size Register file size Max. threads per block You can even make apps self-tuning (like FFTW and ATLAS) “Experiment” mode discovers and saves optimal configuration © NVIDIA Corporation 2008 67
  • 130. More ? • Next week: GPU “Scripting”, Meta-programming, Auto-tuning • Thu 3/31/11: PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU) • Tue 3/29/11: Algorithm Strategies (W. Hwu, UIUC) • Tue 4/5/11: Analysis-driven Optimization (C.Wooley, NVIDIA) • Thu 4/7/11: Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis) • Thu 4/14/11: Optimization for Ninjas (D.Merill, UVirg) • ...
  • 131. one more thing or two...
  • 132. Life/Code Hacking #2.x Speed {listen,read,writ}ing accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 133. Life/Code Hacking #2.2 Speed writing accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 134. Life/Code Hacking #2.2 Speed writing http://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 135. Life/Code Hacking #2.2 Speed writing Typing tutors: gtypist, ktouch, typingweb.com, etc. accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 136. Life/Code Hacking #2.2 Speed writing Kinesis Advantage (QWERTY/DVORAK) accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 137. Demo
  • 138. CO ME