SlideShare une entreprise Scribd logo
1  sur  136
Télécharger pour lire hors ligne
Massively Parallel Computing
                        CS 264 / CSCI E-292
Lecture #7: GPU Cluster Programming | March 8th, 2011




               Nicolas Pinto (MIT, Harvard)
                      pinto@mit.edu
Administrativia
•   Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11

•   Project info: http://www.cs264.org/projects/projects.html

•   Project ideas: http://forum.cs264.org/index.php?board=6.0

•   Project proposal deadline: Fri 3/25/11
    (but you should submit way before to start working on it asap)

•   Need a private           private repo for your project?

    Let us know! Poll on the forum:
    http://forum.cs264.org/index.php?topic=228.0
Goodies
• Guest Lectures: 14 distinguished speakers
• Schedule updated (see website)
Goodies (cont’d)
• Amazon AWS free credits coming soon
  (only for students who completed HW0+1)


• It’s more than $14,000 donation for the class!
• Special thanks: Kurt Messersmith @ Amazon
Goodies (cont’d)
• Best Project Prize: Tesla C2070 (Fermi) Board
• It’s more than $4,000 donation for the class!
• Special thanks:
  David Luebke & Chandra Cheij @ NVIDIA
During this course,
                          r CS264
                adapted fo



we’ll try to


          “                         ”

and use existing material ;-)
Today
yey!!
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
The Problem

                Many computational problems too big for single CPU
                         Lack of RAM
                         Lack of CPU cycles
                Want to distribute work between many CPUs




slide by Richard Edgar
Types of Parallelism

                Some computations are ‘embarrassingly parallel’
                Can do a lot of computation on minimal data
                         RC5 DES, SETI@HOME etc.
                Solution is to distribute across the Internet
                         Use TCP/IP or similar



slide by Richard Edgar
Types of Parallelism

                Some computations very tightly coupled
                Have to communicate a lot of data at each step
                         e.g. hydrodynamics
                Internet latencies much too high
                         Need a dedicated machine



slide by Richard Edgar
Tightly Coupled Computing

                Two basic approaches
                         Shared memory
                         Distributed memory
                Each has advantages and disadvantages




slide by Richard Edgar
dvariables.
   variables.
uted memory private memory for each processor, only acces
 uted memory private memory for each processor, only acce
                Some terminology
 ocessor, so no synchronization for memory accesses neede
  ocessor, so no synchronization for memory accesses neede
mationexchanged by sending data from one processor to ano
 ation exchanged by sending data from one processor to an
 interconnection network using explicit communication opera
  interconnection network using explicit communication opera
     M
     M    M
          M              M
                         M          PP      PP             PP


     PP    PP            PP
                                   Interconnection Network
                                    Interconnection Network


     Interconnection Network
      Interconnection Network      M
                                   M       M
                                           M           M
                                                       M

     “distributed memory”
 approach increasingly common            “shared memory”
d approach increasingly common
                      now: mostly hybrid
variables.
uted memory private memory for each processor, only acces

               Some terminology
ocessor, so no synchronization for memory accesses neede
ation exchanged by sending data from one processor to ano
 interconnection network using explicit communication opera
  interconnection network using explicit communication opera
    M
    M     M
          M             M
                        M          P      P              P



     PP   PP            PP
                                  Interconnection Network



    Interconnection Network
     Interconnection Network      M       M           M

     “distributed memory”
 approach increasingly common          “shared memory”
                     now: mostly hybrid
Shared Memory Machines

                Have lots of CPUs share the same memory banks
                         Spawn lots of threads
                         Each writes to globally shared memory
                Multicore CPUs now ubiquitous
                         Most computers now ‘shared memory machines’



slide by Richard Edgar
Shared Memory Machines




                           NASA ‘Columbia’ Computer
                         Up to 2048 cores in single system
slide by Richard Edgar
Shared Memory Machines
                Spawning lots of threads (relatively) easy
                         pthreads, OpenMP
                Don’t have to worry about data location
                Disadvantage is memory performance scaling
                         Frontside bus saturates rapidly
                Can use Non-Uniform Memory Architecture (NUMA)
                         Silicon Graphics Origin & Altix series
                         Gets expensive very fast

slide by Richard Edgar
d variables.
 uted memory private memory for each processor, only acce
               Some terminology
 ocessor, so no synchronization for memory accesses neede
mation exchanged by sending data from one processor to an
  interconnection network using explicit communication opera

      M    M             M         PP      PP             PP


      P    P             P
                                  Interconnection Network
                                   Interconnection Network



     Interconnection Network      M       M           M
                                  M       M           M

      “distributed memory”              “shared memory”
d approach increasingly common
                     now: mostly hybrid
Distributed Memory Clusters

                Alternative is a lot of cheap machines
                High-speed network between individual nodes
                         Network can cost as much as the CPUs!
                How do nodes communicate?




slide by Richard Edgar
Distributed Memory Clusters




                         NASA ‘Pleiades’ Cluster
                             51,200 cores
slide by Richard Edgar
Distributed Memory Model
                Communication is key issue
                         Each node has its own address space
                         (exclusive access, no global memory?)
                Could use TCP/IP
                         Painfully low level
                Solution: a communication protocol like message-
                passing (e.g. MPI)


slide by Richard Edgar
Distributed Memory Model


                All data must be explicitly partitionned
                Exchange of data by explicit communication




slide by Richard Edgar
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Message Passing Interface

                MPI is a communication protocol for parallel programs
                         Language independent
                         Open standard
                Originally created by working group at SC92
                Bindings for C, C++, Fortran, Python, etc.

             http://www.mcs.anl.gov/research/projects/mpi/
                      http://www.mpi-forum.org/
slide by Richard Edgar
Message Passing Interface

                MPI processes have independent address spaces
                Communicate by sending messages
                Means of sending messages invisible
                         Use shared memory if available! (i.e. can be used
                         behind the scenes shared memory architectures)
                         On Level 5 (Session) and higher of OSI model


slide by Richard Edgar
OSI Model ?
Message Passing Interface

                MPI is a standard, a specification, for message-passing
                libraries
                Two major implementations of MPI
                         MPICH
                         OpenMPI
                Programs should work with either


slide by Richard Edgar
Basic Idea


                 • Usually programmed with SPMD model (single program,
                      multiple data)
                 • In MPI-1 number of tasks is static - cannot dynamically
                      spawn new tasks at runtime. Enhanced in MPI-2.
                 • No assumptions on type of interconnection network; all
                      processors can send a message to any other processor.
                 • All parallelism explicit - programmer responsible for
                      correctly identifying parallelism and implementing parallel
                      algorithms




adapted from Berger & Klöckner (NYU 2010)
Credits: James Carr (OCI)
Hello World

           #include <mpi.h>
           #include <stdio.h>

           int main(int argc, char** argv) {
             int rank, size;
             MPI_Init(&argc, &argv);

                  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                  MPI_Comm_size(MPI_COMM_WORLD, &size);

                  printf("Hello world from %d of %dn", rank, size);

                  MPI_Finalize();
                  return 0;
           }
adapted from Berger & Klöckner (NYU 2010)
Hello World
To compile: Need to load “MPI” wrappers in addition to the
compiler modules (OpenMPI,‘ MPICH,...)
          module load mpi/openmpi/1.2.8/gnu
            module load openmpi/intel/1.3.3

To compile: mpicc hello.c
To run: need to tell how many processes you are requesting
   mpiexec -n 10 a.out (mpirun -np 10 a.out)




                                                   adapted from Berger & Klöckner (NYU 2010)
The beauty of data
   visualization




    http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data
   visualization




    http://www.youtube.com/watch?v=pLqjQ55tz-U
Example: gprof2dot
“ They’ve done studies, you
know. 60% of the time, it
works every time... ”

- Brian Fantana
(Anchorman, 2004)
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Basic MPI

                MPI is a library of routines
                Bindings exist for many languages
                         Principal languages are C, C++ and Fortran
                         Python: mpi4py
                We will discuss C++ bindings from now on

        http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm




slide by Richard Edgar
Basic MPI

                MPI allows processes to exchange messages
                Processes are members of communicators
                         Communicator shared by all is   MPI::COMM_WORLD

                         In C++ API, communicators are objects
                Within a communicator, each process has unique ID



slide by Richard Edgar
A Minimal MPI Program
     #include <iostream>
     using namespace std;

     #include “mpi.h”

     int main( int argc, char* argv ) {     Very much a minimal
                                            program
          MPI::Init( argc, argv );
                                            No actual
          cout << “Hello World!” << endl;
                                            communication occurs
          MPI::Finalize();

          return( EXIT_SUCCESS );
     }


slide by Richard Edgar
A Minimal MPI Program

                To compile MPI programs use mpic++
                mpic++ -o MyProg myprog.cpp

                The mpic++ command is a wrapper for default compiler
                         Adds in libraries
                         Use mpic++   --show   to see what it does
                Will also find mpicc, mpif77 and mpif90 (usually)



slide by Richard Edgar
A Minimal MPI Program

                To run the program, use mpirun
                mpirun -np 2 ./MyProg

                The -np       2   option launches two processes
                Check documentation for your cluster
                         Number of processes might be implicit
                Program should print “Hello World” twice



slide by Richard Edgar
Communicators

                Processes are members of communicators
                A process can
                         Find the size of a given communicator
                         Determine its ID (or rank) within it
                Default communicator is MPI::COMM_WORLD



slide by Richard Edgar
Communicators

int nProcs, iMyProc;
MPI::Init( argc, argv );                Queries COMM_WORLD communicator for
nProcs = MPI::COMM_WORLD.Get_size();       Number of processes
iMyProc = MPI::COMM_WORLD.Get_rank();
                                           Current process rank (ID)
cout << “Hello from process ”;
cout << iMyProc << “ of ”;              Prints these out
cout << nProcs << endl;                 Process rank counts from zero
MPI::Finalize();




slide by Richard Edgar
Communicators


                By convention, process with rank 0 is master
                const int iMasterProc = 0;

                Can have more than one communicator
                         Process may have different rank within each




slide by Richard Edgar
Messages

                Haven’t sent any data yet
                Communicators have Send and Recv methods for this
                One process posts a Send
                Must be matched by Recv in the target process




slide by Richard Edgar
Sending Messages
                A sample send is as follows:
                int a[10];
                MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );

                The method prototype is
                void Comm::Send( const void* buf, int count,
                                 const Datatype& datatype,
                                 int dest, int tag) const

                MPI copies the buffer into a system buffer and returns
                         No delivery notification


slide by Richard Edgar
Receiving Messages

                Similar call to receive                MPI::ANY_SOURCE
                int a[10];
                MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, MPI::ANY_TAG
                                                                  iMyTag);

                Function prototype is
                void Comm::Recv( void* buf, int count,
                                 const Datatype& datatype,
                                 int source, int tag) const

                Blocks until data arrives



slide by Richard Edgar
MPI Datatypes
                                     MPI Datatype        C/C++

                MPI datatypes are     MPI::CHAR        signed char

                independent of        MPI::SHORT      signed short


                         Language      MPI::INT        signed int


                         Endianess    MPI::LONG        signed long


                Most common listed    MPI::FLOAT          float

                opposite             MPI::DOUBLE         double


                                      MPI::BYTE     Untyped byte data


slide by Richard Edgar
MPI Send & Receive
 if( iMyProc == iMasterProc ) {
   for( int i=1; i<nProcs; i++ ) {
      int iMessage = 2 * i + 1;
      cout << “Sending ” << iMessage <<
           “ to process ” << i << endl;
      MPI::COMM_WORLD.Send( &iMessage, 1,        Master process sends
                            MPI::INT,
                            i, iTag );           out numbers
   }
 } else {
   int iMessage;                                 Worker processes print
   MPI::COMM_WORLD.Recv( &iMessage, 1,
                          MPI::INT,
                                                 out number received
                          iMasterProc, iTag );
   cout << “Process ” << iMyProc <<
         “ received ” << iMessage << endl;
 }




slide by Richard Edgar
Six Basic MPI Routines

                Have now encounted six MPI routines
                MPI::Init(), MPI::Finalize()
                MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),
                MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()

                These are enough to get started ;-)
                More sophisticated routines available...




slide by Richard Edgar
Collective Communications

                Send      and Recv are point-to-point
                         Communicate between specific processes
                Sometimes we want all processes to exchange data
                These are called collective communications




slide by Richard Edgar
Barriers

                Barriers require all processes to synchronise
                MPI::COMM_WORLD.Barrier();

                Processes wait until all processes arrive at barrier
                         Potential for deadlock
                Bad for performance
                         Only use if necessary



slide by Richard Edgar
Broadcasts

                Suppose one process has array to be shared with all
                int a[10];
                MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );

                If process has rank iSrcProc, it will send the array
                Other processes will receive it
                All will have a[10] identical to iSrcProc on completion



slide by Richard Edgar
MPI Broadcast


           P0               A                               P0   A

                                                Broadcast        A
           P1                                               P1


           P2                                               P2   A


           P3                                               P3   A




           MPI Bcast(&buf, count, datatype, root, comm)
           All processors must call MPI Bcast with the same root value.



adapted from Berger & Klöckner (NYU 2010)
Reductions

                Suppose we have a large array split across processes
                We want to sum all the elements
                Use MPI::COMM_WORLD.Reduce()        with MPI::Op SUM

                         Also MPI::COMM_WORLD.Allreduce() variant
                Can perform MAX, MIN, MAXLOC, MINLOC too



slide by Richard Edgar
MPI Reduce


           P0                 A                          P0   ABCD


           P1                 B             Reduce       P1

           P2                 C                          P2

           P3                 D                          P3


           Reduction operators can be min, max, sum, multiply, logical
           ops, max value and location ... Must be associative
           (commutative optional)


adapted from Berger & Klöckner (NYU 2010)
Scatter and Gather

                Split a large array between processes
                         Use MPI::COMM_WORLD.Scatter()
                         Each process receives part of the array
                Combine small arrays into one large one
                         Use MPI::COMM_WORLD.Gather()
                         Designated process will construct entire array
                         Has MPI::COMM_WORLD.Allgather() variant


slide by Richard Edgar
MPI Scatter/Gather




           P0               A          B    C    D                 P0   A
                                                      Scatter

           P1                                                      P1   B

                                                          Gather
           P2                                                      P2   C


           P3                                                      P3   D




adapted from Berger & Klöckner (NYU 2010)
MPI Allgather




           P0               A                               P0   A   B   C   D

                            B                   Allgather        A   B   C   D
           P1                                               P1


           P2               C                               P2   A   B   C   D


           P3               D                               P3   A   B   C   D




adapted from Berger & Klöckner (NYU 2010)
MPI Alltoall




           P0               A0         A1   A2   A3                 P0   A0   B0   C0   D0

                            B0         B1   B2   B3      Alltoall        A1   B1   C1   D1
           P1                                                       P1


           P2               C0         C1   C2   C3                 P2   A2   B2   C2   D2


           P3               D0         D1   D2   D3                 P3   A3   B3   C3   D3




adapted from Berger & Klöckner (NYU 2010)
Asynchronous Messages

                An asynchronous API exists too
                         Have to allocate buffers
                         Have to check if send or receive has completed
                Will give better performance
                         Trickier to use



slide by Richard Edgar
User-Defined Datatypes

                Usually have complex data structures
                Require means of distributing these
                         Can pack & unpack manually
                MPI allows us to define own datatypes for this




slide by Richard Edgar
MPI-2


                 • One-sided RMA (remote memory access) communication

                            • potential for greater efficiency, easier programming.
                            • Use ”windows” into memory to expose regions for access
                            • Race conditions now possible.


                 • Parallel I/O like message passing but to file system not
                      other processes.
                 • Allows for dynamic number of processes and
                      inter-communicators (as opposed to intra-communicators)
                 • Cleaned up MPI-1



adapted from Berger & Klöckner (NYU 2010)
RMA
                • Processors can designate portions of its address space as
                     available to other processors for read/write operations
                     (MPI Get, MPI Put, MPI Accumulate).
                • RMA window objects created by collective window-creation
                     fns. (MPI Win create must be called by all participants)
                • Before accessing, call MPI Win fence (or other synchr.
                     mechanisms) to start RMA access epoch; fence (like a barrier)
                     separates local ops on window from remote ops
                • RMA operations are no-blocking; separate synchronization
                     needed to check completion. Call MPI Win fence again.



                                                              RMA window
                                            Put

                                  P0 local memory      P1 local memory
adapted from Berger & Klöckner (NYU 2010)
Some MPI Bugs
MPIMPI Bugs
                                            Sample Bugs




       Only works for even number of processors. What’s w rong?



adapted from Berger & Klöckner (NYU 2010)
MPIMPI Bugs
                                            Sample Bugs




       Only works for even number of processors.



adapted from Berger & Klöckner (NYU 2010)
MPI Bugs
                                            Sample MPI Bugs
           Suppose you have a local variable “energy” and you want
           to sum all the processors “energy” to and wanttotal energy
           Supose have local variable, e.g. energy, find the to sum all
           of the system energy to find total energy of the system.
           the processors

            Recall

            MPI_Reduce(sendbuf,recvbuf,count,datatype,op,
                       root,comm)

                                                           hat’s w rong?
            Using the same variable, as in
                                                          W

            MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,
                       MPI_COMM_WORLD)
adapted from Berger & Klöckner (NYU 2010)
Communication
  Topologies
Communication Topologies


                Some topologies very common
                         Grid, hypercube etc.
                API provided to set up communicators following these




slide by Richard Edgar
Parallel Performance


           Recall Amdahl’s law:
                    if T1 = serial cost + parallel cost
                    then
                      Tp = serial cost + parallel cost/p
           But really
                    Tp = serial cost + parallel cost/p + Tcommunication

           How expensive is it?




adapted from Berger & Klöckner (NYU 2010)
Network Characteristics


           Interconnection network connects nodes, transfers data
           Important qualities:
                 • Topology - the structure used to connect the nodes

                 • Routing algorithm - how messages are transmitted
                      between processors, along which path (= nodes along
                      which message transferred).
                 • Switching strategy = how message is cut into pieces and
                      assigned a path
                 • Flow control (for dealing with congestion) - stall, store data
                      in buffers, re-route data, tell source to halt, discard, etc.


adapted from Berger & Klöckner (NYU 2010)
Interconnection Network
           Represent as graph G = (V , E), V = set of nodes to be
           connected, E = direct links between the nodes. Links usually
           bidirectional - transfer msg in both directions at same time.
           Characterize network by:
                 • diameter - maximum over all pairs of nodes of the shortest
                      path between the nodes (length of path in message
                      transmission)
                 • degree - number of direct links for a node (number of direct
                      neighbors)
                 • bisection bandwidth - minimum number of edges that must
                      be removed to partition network into two parts of equal size
                      with no connection between them. (measures network
                      capacity for transmitting messages simultaneously)
                 • node/edge connectivity - numbers of node/edges that must
                      fail to disconnect the network (measure of reliability)
adapted from Berger & Klöckner (NYU 2010)
Linear Array




                 • p vertices, p − 1 links
                 • Diameter = p − 1
                 • Degree = 2
                 • Bisection bandwidth = 1
                 • Node connectivity = 1, edge connectivity = 1




adapted from Berger & Klöckner (NYU 2010)
Ring topology




                 • diameter = p/2
                 • degree = 2
                 • bisection bandwidth = 2
                 • node connectivity = 2
                      edge connectivity = 2




adapted from Berger & Klöckner (NYU 2010)
Mesh topology

                                                               √
                                                • diameter = 2( p − 1)
                                                               √
                                                  3d mesh is 3( 3 p − 1)
                                                • degree = 4 (6 in 3d )
                                                                          √
                                                • bisection bandwidth         p
                                                • node connectivity 2
                                                  edge connectivity 2


           Route along each dimension in turn




adapted from Berger & Klöckner (NYU 2010)
Torus topology




           Diameter halved, Bisection bandwidth doubled,
           Edge and Node connectivity doubled over mesh

adapted from Berger & Klöckner (NYU 2010)
Hypercube topology
                                                                                  0110                           0111

            0                        1                                    0010                     0011
                                                     110           111
                                                                                         1110            1111

           10                       11        010           011                   1010           1011


                                                                                 01001100               1101    0101
                                                    100           101
                                                                                 1000           1001

           00                       01       000           001           0000                    0001




                 • p = 2k processors labelled with binary numbers of length k
                 • k -dimensional cube constructed from two (k − 1)-cubes
                 • Connect corresponding procs if labels differ in 1 bit
                      (Hamming distance d between 2 k -bit binary words =
                      path of length d between 2 nodes)
adapted from Berger & Klöckner (NYU 2010)
Hypercube topology
                                                                                  0110                           0111

            0                        1                                    0010                     0011
                                                     110           111
                                                                                         1110            1111

           10                       11        010           011                   1010           1011


                                                                                 01001100               1101    0101
                                                    100           101
                                                                                 1000           1001

           00                       01       000           001           0000                    0001




                 • diameter = k ( =log p)
                 • degree = k
                 • bisection bandwidth = p/2
                 • node connectivity k
                      edge connectivity k
adapted from Berger & Klöckner (NYU 2010)
Dynamic Networks


           Above networks were direct, or static interconnection networks
           = processors connected directly with each through fixed
           physical links.

           Indirect or dynamic networks = contain switches which provide
           an indirect connection between the nodes. Switches configured
           dynamically to establish a connection.
                 • bus
                 • crossbar
                 • multistage network - e.g. butterfly, omega, baseline




adapted from Berger & Klöckner (NYU 2010)
Crossbar
                                            P1
                                            P2




                                            Pn


                                                 M1 M2          Mm


                 • Connecting n inputs and m outputs takes nm switches.
                      (Typically only for small numbers of processors)
                 • At each switch can either go straight or change dir.
                 • Diameter = 1, bisection bandwidth = p
adapted from Berger & Klöckner (NYU 2010)
Butterfly

           16 × 16 butterfly network:
                                            stage 0   stage 1      stage 2   stage 3
                                   000

                                   001
                                   010
                                   011
                                   100
                                   101

                                   110
                                   111

           for p = 2k +1 processors, k + 1 stages, 2k switches per stage,
           2 × 2 switches

adapted from Berger & Klöckner (NYU 2010)
Fat tree




                 • Complete binary tree
                 • Processors at leaves
                 • Increase links for higher bandwidth near root
adapted from Berger & Klöckner (NYU 2010)
Current picture



                 • Old style: mapped algorithms to topologies
                 • New style: avoid topology-specific optimizations
                            • Want code that runs on next year’s machines too.
                            • Topology awareness in vendor MPI libraries?
                            • Software topology - easy of programming, but not used for
                                  performance?




adapted from Berger & Klöckner (NYU 2010)
Should we care ?

• Old school: map algorithms to specific
  topologies

• New school: avoid topology-specific
  optimimizations (the code should be optimal
  on next year’s infrastructure....)
• Meta-programming / Auto-tuning ?
chart in table format using the statistics page. A direct link to the statistics is also
available.
                           Top500 Interconnects




                                                                                                              Statisti


                                                                                                            Top500
                                                                                                               06/20


                                                                                                            Statisti
                                                                                                               Vendo


                                                                                                               Genera




                                                                                                          Search


                                                                             adapted from Berger & Klöckner (NYU 2010)
MPI References

                 • Lawrence Livermore tutorial
                      https:computing.llnl.gov/tutorials/mpi/

                 • Using MPI
                      Portable Parallel Programming with the Message=Passing
                      Interface
                      by Gropp, Lusk, Skjellum

                 • Using MPI-2
                      Advanced Features of the Message Passing Interface
                      by Gropp, Lusk, Thakur

                 • Lots of other on-line tutorials, books, etc.



adapted from Berger & Klöckner (NYU 2010)
Ignite: Google Trends




     http://www.youtube.com/watch?v=m0b-QX0JDXc
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
MPI with CUDA
                MPI and CUDA almost orthogonal
                Each node simply becomes faster
                Problem matching MPI processes to GPUs
                         Use compute-exclusive mode on GPUs
                         Tell cluster environment to limit processes per node
                Have to know your cluster documentation


slide by Richard Edgar
Data Movement

                Communication now very expensive
                GPUs can only communicate via their hosts
                         Very laborious
                Again: need to minimize communication




slide by Richard Edgar
MPI Summary

                MPI provides cross-platform interprocess
                communication
                Invariably available on computer clusters
                Only need six basic commands to get started
                Much more sophistication available



slide by Richard Edgar
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
ZeroMQ
               • ‘messaging middleware’ ‘TCP on steroids’
                     ‘new layer on the networking stack’
               • not a complete messaging system
               • just a simple messaging library to be
                     used programmatically.
               • a “pimped” socket interface allowing you to
                     quickly design / build a complex
                     communication system without much effort

http://nichol.as/zeromq-an-introduction                    http://zguide.zeromq.org/page:all
ZeroMQ
               • Fastest. Messaging. Ever.
               • Excellent documentation:
                 • examples
                 • white papers for everything
               • Bindings for Ada, Basic, C, Chicken Scheme,
                     Common Lisp, C#, C++, D, Erlang*, Go*,
                     Haskell*, Java, Lua, node.js, Objective-C, ooc,
                     Perl, Perl, PHP, Python, Racket, Ruby,Tcl
http://nichol.as/zeromq-an-introduction                       http://zguide.zeromq.org/page:all
Message Patterns




http://nichol.as/zeromq-an-introduction   http://zguide.zeromq.org/page:all
Demo: Why ZeroMQ ?




http://www.youtube.com/watch?v=_JCBphyciAs
MPI vs ZeroMQ ?
•   MPI is a specification, ZeroMQ is an implementation.
•   Design:
    •   MPI is designed for tightly-coupled compute clusters with fast and reliable
        networks.
    •   ZeroMQ is designed for large distributed systems (web-like).
•   Fault tolerance:
    •   MPI has very limited facilities for fault tolerance (the default error handling
        behavior in most implementations is a system-wide fail, ouch!).
    •   ZeroMQ is resilient to faults and network instability.
•   ZeroMQ could be a good transport layer for an MPI-like implementation.


                    http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq
F ast Fo rward




                 CUDASA
!"#$%&#$"'()*"'#+,

CUDASA: Computed Unified Device &'9(8:/#1;/(
                                Systems Architecture
1234J'1/.(8#$'2%,0,$&'3$C,B$'4*BI,#$B#8*$'
   78
          !"#$%&'()*)++$+,-.'/0'1234'0/*')'-,%5+$'672'#/
                -./(01%102 .8+#,9672'-:-#$.-
                      ;<=>'?$-+)>'@8)&*/7+$">'AAA
                31#4"56(01%102 672'B+8-#$*'$%C,*/%.$%#-
                      D:*,%$#>'=%0,%,E)%&>'D:*,FG6>'AAA
          1/%-,-#$%#'&$C$+/($*',%#$*0)B$
          !)-,+:'$.H$&&$&'#/'#I$'1234'B/.(,+$'(*/B$--




!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(   :
!"#$%&'()%*)+%,

       233456&.507           -(./0)1$%&'()                   *+,$%&'()   !"#$%&'()




          *&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(;
                                             ;-.*-& ?4061,$57$3&)&44(4
                -0$60@@+756&.507$?(./((7$;-.*-& ?4061,
          #7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6(
                -0$60=($@0=5A56&.507$)(E+5)(=
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                    :
!"#$%&'()%*)+%,

       233456&.507           -(./0)1$%&'()                   *+,$%&'()   !"#$%&'()




          *8#9$-./'0'1./'0) -./'234"':;0,.<
                =7($*8#$(>+&4,$07($"=?@A$.;)(&B
                                      (%#; C4061,$57$3&)&44(4
                244$C4061,$,;&)($60DD07$,',.(D$D(D0)'$:(>EF$G#H2$I40C&4$D(D0)'<
                J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                    :
!"#$%&'()%*)+%,

       233456&.507           -(./0)1$%&'()                   *+,$%&'()   !"#$%&'()




          *8#9$#+-./%'0/1#$%*'23':70;(<
                =7($*8#$(>+&4,$07($?"@$3)06(,,
                ?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4
                -0$57.)57,56$A40B&4$C(C0)'$ D5,.)5B+.(;$,E&)(;$C(C0)'$C&7&A(C(7.
                F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                    :
!"#$%&'()%*)+%,

       233456&.507           -(./0)1$%&'()                   *+,$%&'()   !"#$%&'()




          8(9+(7.5&4$&33456&.507$3)06(,,
                2):5.)&)'$;<;==$&33456&.507$60>(
                24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)'
                B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4


!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                    :
!"#$%"$&'()*&#+,-#+

!"#$%&'(
          )#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7
          ;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71
                8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4
                ;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1
                      97",7+44+<5:5'2

          A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"#




!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(       :
!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'()*+!+',-.&/
!"#$%&'()*+',-.&/
                     !!"#$%&#!!'($)*'"+,-./0#$&12'3&4&561647'8999:';;'<=>?@=
     925,34-5'
     :2"%464&8       !!A$B1!!'($)*'A+,-./9997'8                     ;;'CDEF
                          999'
   01&,234-5'             "+,-.'GGG'<"H'<%H'IB'JJJ/3&4&561647K
(-564728"34-5
                     :

                     !!1&BL!! ($)*'1+,-./0#$&12'3&4&561647'8        ;;'CDEF
        =&>
                          A+,-./3&4&561647K
     925,34-5'
     :2"%464&8       :
                     !!B6M,6-.6!! ($)*'B+,-./9997'8                 ;;'NOO
 ;&5&8"%4<&.              1+,-.'GGG'<"'JJJ/3&4&561647K
   01&,234-5'
(-564728"34-5        :

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(              :
!"#$%"$&'()*&#+,-#+'./-#*01



!"#$%&'$()*                     +,-)#./                    0*$.%*&1          23(1$4(*#
&--1('&$()*51&6.%                                      !!"#$%#&'#!!

*.$7)%851&6.%                   !!()*!!                   !!&)+#!!    ()*,+-./()*012

"3#51&6.%                      !!34"5!!                   !!6)"3!!    34"5,+-./34"5012

9:;51&6.%                    !!78)*48!!                 !!+#91'#!!    7:1+012./*8)'5,+-./
                                                                      *8)'5012./36:#4+,+-


          +,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()*
          23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#




!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                           :;
!"#$%"%&'(')*&

    !"#$%&'()%*+,+-
          ;)< ,#.((%#= /01'."2%#3
                4.2'(567/(+'8"'/$(#'9(9:+;2:,.
          <%/$+%*"$'.(/1,+'.(&'&:+-(=:+(#'$9:+;(2,-'+

          >:&&:#(%#$'+=,0'(="#0$%:#/(?'@3@(,$:&%0(="#0$%:#/A



    >7<BCB(>:&D%2'+
          >:.'($+,#/2,$%:#(=+:&(>7<BCB(0:.'($:(>7<B(9%$1($1+',./EFG4
                C'2=H0:#$,%#'.(D+'H0:&D%2'+($:(>7<B(0:&D%2'+(D+:0'//
                      5,/'.(:#(62/,(?>II(D,+/'+A(9%$1(,..'.(>7<BE>7<BCB(="#0$%:#,2%$-
                J"22(,#,2-/%/(:=(/-#$,K(,#.(/'&,#$%0/(+'8"%+'.



!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                       ::
!"#$%"%&'(')*&+,-./,0(1%2

 !"#$%&'()*+'(,-#./&#,0*.1
                      !!"#$%!!&'()*&"+,-./)-"&)0&12(#"&314&5&666&7

                      "89:*:1&$";,."&5

;/'-<+'=0.'+               )-"&)<&12(#"&31<&
       >:0&,<0./           *)=>&"#$%?*@0&"#$%A)=<
                      7&B;#99:;!$";,."!"+,-.<

23456(,7-'#+(         '()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5
 )*$%#,08&'(               )-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1<
   /09.#,:-'
                           *)=>&"#$%?*@&C&9#;#=DE"#$%?*@<
                           *)=>&"#$%A)=&C&9#;#=DE"#$%A)=<
      3-090.#&(
                           5&666&7
=:.),0*.(8*+?
                      7

 !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(   :;
!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1
                          '/012#333#%&#4445!"#$67


2'.'-#,'+()*+'(&#3*4,
          5*%3(64.),0*.(%#-#$','-/(0.,*(7-#%%'-(/,-4),              8 !"#$#9
          :*%4&#,'(/);'+4&'-(<4'4'(70,;(#&&(=&*)>/(*6(,;'((%#<9."=+ 8 %&#9
          ?','-$0.'(=40&,@0./(6*-('#);(=&*)>                        8 '()*+,-"#'()*%!. 9
          A#>'(4%(B!C(7*->'-(,;-'#+/(6-*$(,;'(,;-'#+(%**&
                D+&'(B!C/(-'<4'/,(.'",(%'.+0.E(=&*)>(6-*$(<4'4'
          A#0,(6*-(#&&(=&*)>/(,*(='(%-*)'//'+
                D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&




!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                    :;
!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'()*+(&,")
          -)(+".%&"%/0*%+(1$'%0*,).%234%,)&$'5(6$
          789*%'0)%6":;,+$<&,:$%.$)$'(&$#%$=$)&%+"";%>$?=@%&"%&A'$(#%;""+B
          -;;+,6(&,")%,**0$*%/'"(#6(*&%:$**(.$*%&"
                ,**0$%$C$60&,")%<)=+8$*/(")*#
                ;$'5"':%*A('$#%#,*&',/0&$#%:$:"'1%";$'(&,")*




!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(            :;
!"#$%"%&'(')*&+,-%'.*/0,1(2%/

    !"#$%&'&()*$(+,*%&'-%-.$/
          01#+2%'3.-4,*#*(.1)'%53%%&(16')/)*%-'-%-.$/'.7')(162%'1.&%
                0#3"'32,)*%$'1.&%'&%&(3#*%)'4#$*'.7')/)*%-'-%-.$/'7.$'8!9
                :.1*(1,.,)';($*,#2'#&&$%))'$#16%
                      <1*%$7#3%';(#'!"#$%$&$''(!)*!"#$%$&+,!-.)*!"#$%$/0++
                      0;=>'*.':?8@'62.+#2'-%-.$/'-#1#6%-%1*
                <-42%-%1*%&',)(16'9A<'B%-.*%'9%-.$/'@33%))'CB9@D
                      E.'6,#$#1*(%)'7.$'3.13,$$%1*'1.1F#*.-(3'-%-.$/'#33%))%)'C#)'(1':?8@D
                      :?8@!@'#*.-(3'7,13*(.1)'7.$'8!9




!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                            :;
!"#$%&#'()$#(*+,+%%"%-#.

!"#$%&'#"%()%*+,-%."/"0'1%2'$034%251$3617%8-9:;;<
          *531=%>/%$>6%>?%@A*+,-%-9:;;%13B0'07%?5/&$3>/
          *1>&CDB'#"=%#5BD2'$034%60>&"##3/.%'6613"=%>/%'11%1"E"1#%>?%6'0'11"13#2

                                                                    @FA)
                                                                           ,;G%H6$"0>/%IJKL%I4I%&>0"#
                                                                           M/$"1%NOOKKL%P%&>0"#

                                                                    9FA)
                                                                           QRMGM,%N5'=0>%STUOKK

                                                                           QRMGM,%VVKK9!T%A1$0'
                                                                             I%&'0=#% 8(OW(O%1'/"#<
                                                                           QRMGM,%VVKK9!
                                                                             I%&'0=#% 8(OWP%1'/"#<
                                                                             X%&'0=#% 8(OWPWP%1'/"#<
                                                                             P%&'0=#% 8(OWPWPWP%1'/"#<


!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                                     :;
!"#$%&#'()$#(*+,+%%"%-#.(/012&34

!"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$506
          1<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")*
          78%9'&:#,'&:"/%"$%'18;%(<<=

                >'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?#
                A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E

         J+2,"/%03%
                                           C%G>A                    (%G>A#   M%G>A#
         #&"6"%"1"2"6$#
         N(=OD                               P(O                     (ON      C(Q
         CNC<=(                             C<N<                     P(<      (OP
         F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#



!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                      :;
!"#$%&#'()"&*+,-(./,/%%"%0#1

    !"##$%#$%&#'&()*+%,-.//%012'3%'(*4#"5%)6"677(7218
          9(1*%&61(%:+
                ;%&701*("%'#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("'(*
                ;B???; 86*"2&(1% 234(56%+7#
                      <)= &#8)0*6*2#'+%            :CD1%
                      <)= &#880'2&6*2#'+           ECF1
              G23@%&#880'2&6*2#'%&#1*1


                ,2'37(%!H%42*@%E%-!I1%01(<%61%E%12'37(%-!I%&701*("%'#<(1
                :?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#'7L+%M:E%-N7#)1O
                      P(Q02"(1%2'*("R)"#&(11%&#880'2&6*2#'
                      S,/%6&&(11(1%*65(%:CB%*28(1%7#'3("%*@6'%&#8)0*6*2#'
                      T#%646"('(11%#$%<6*6%7#&672*L
              G23@%J0''(&(116"LO%&#880'2&6*2#'%#K("@(6<

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(           :;
!"#$%&'("#

!"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-;
          <.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+
                ?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@
          B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4
                (-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,-
          (2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+--


          !611+,*':1/D+=*'-*2*+&
                ()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C
                      E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")*
                      <.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@
                      $6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"-
                G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+



!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.(                        :;
F ast Fo rward




    MultiGPU MapReduce
MapReduce




            http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
Why MapReduce?

•   Simple programming model

•   Parallel programming model

•   Scalable



•   Previous GPU work: neither multi-GPU nor out-of-core
Benchmarks—Which
•   Matrix Multiplication (MM)

•   Word Occurrence (WO)

•   Sparse-Integer Occurrence (SIO)

•   Linear Regression (LR)

•   K-Means Clustering (KMC)



•   (Volume Renderer—presented 90
    minutes ago @ MapReduce ’10)
Benchmarks—Why
•   Needed to stress aspects of GPMR

    •   Unbalanced work (WO)

    •   Multiple emits/Non-uniform number of emits (LR, KMC,
        WO)

    •   Sparsity of keys (SIO)

    •   Accumulation (WO, LR, KMC)

    •   Many key-value pairs (SIO)

    •   Compute Bound Scalability (MM)
Benchmarks—Results
WO. We test GPMR against all available input sets.

Benchmarks—Results
                               MM          KMC        LR      SIO      WO
           1-GPU Speedup       162.712     2.991      1.296   1.450    11.080
           4-GPU Speedup       559.209     11.726     4.085   2.322    18.441
vs. CPU
          TABLE 2: Speedup for GPMR over Phoenix on our large (second-
          biggest) input data from our first set. The exception is MM, for which
          we use our small input set (Phoenix required almost twenty seconds      TAB
          to multiply two 1024 × 1024 matrices).                                  writ
                                                                                  all
                                                                                  littl
                                                                                  boil
                                         MM         KMC       WO                  func
                                                                                  GPM
                    1-GPU Speedup        2.695      37.344    3.098               of t
                    4-GPU Speedup        10.760     129.425   11.709
vs. GPU
          TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix
          Multiplication, an 8M-point K-Means Clustering, and a 512 MB
          Word Occurrence. These sizes represent the largest problems that
          can meet the in-core memory requirements of Mars.
Benchmarks - Results




                       Good
Benchmarks - Results




                       Good
Benchmarks - Results




                       Good
one more thing
           or two...
Life/Code Hacking #3
 The Pomodoro Technique
Life/Code Hacking #3
                       The Pomodoro Technique




http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions
http://www.youtube.com/watch?v=QYyJZOHgpco
CO ME

Contenu connexe

Tendances

Low latency microservices in java QCon New York 2016
Low latency microservices in java   QCon New York 2016Low latency microservices in java   QCon New York 2016
Low latency microservices in java QCon New York 2016Peter Lawrey
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMIgor Veresov
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Romanian keyboard layout in ubuntu
Romanian keyboard layout in ubuntuRomanian keyboard layout in ubuntu
Romanian keyboard layout in ubuntuLupu Cosmin
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERTATPowr
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
10 Reasons for Choosing OpenSplice DDS
10 Reasons for Choosing OpenSplice DDS10 Reasons for Choosing OpenSplice DDS
10 Reasons for Choosing OpenSplice DDSAngelo Corsaro
 
Parboiled explained
Parboiled explainedParboiled explained
Parboiled explainedPaul Popoff
 
No hagas ahora lo que puedes dejar para otro momento (Celery)
No hagas ahora lo que puedes dejar para otro momento (Celery)No hagas ahora lo que puedes dejar para otro momento (Celery)
No hagas ahora lo que puedes dejar para otro momento (Celery)⌨ Antonio Páez Gutiérrez
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...Universitat Politècnica de Catalunya
 
An Introduction to ANTLR
An Introduction to ANTLRAn Introduction to ANTLR
An Introduction to ANTLRMorteza Zakeri
 
FreeSWITCH on RedHat, Fedora, CentOS
FreeSWITCH on RedHat, Fedora, CentOSFreeSWITCH on RedHat, Fedora, CentOS
FreeSWITCH on RedHat, Fedora, CentOSAbhishek Kumar
 
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs IstioGOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs IstioNicolas Fränkel
 

Tendances (20)

Word embedding
Word embedding Word embedding
Word embedding
 
Low latency microservices in java QCon New York 2016
Low latency microservices in java   QCon New York 2016Low latency microservices in java   QCon New York 2016
Low latency microservices in java QCon New York 2016
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVM
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Romanian keyboard layout in ubuntu
Romanian keyboard layout in ubuntuRomanian keyboard layout in ubuntu
Romanian keyboard layout in ubuntu
 
CUDA
CUDACUDA
CUDA
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERT
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
10 Reasons for Choosing OpenSplice DDS
10 Reasons for Choosing OpenSplice DDS10 Reasons for Choosing OpenSplice DDS
10 Reasons for Choosing OpenSplice DDS
 
Parboiled explained
Parboiled explainedParboiled explained
Parboiled explained
 
No hagas ahora lo que puedes dejar para otro momento (Celery)
No hagas ahora lo que puedes dejar para otro momento (Celery)No hagas ahora lo que puedes dejar para otro momento (Celery)
No hagas ahora lo que puedes dejar para otro momento (Celery)
 
Why rust?
Why rust?Why rust?
Why rust?
 
Attention
AttentionAttention
Attention
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
An Introduction to ANTLR
An Introduction to ANTLRAn Introduction to ANTLR
An Introduction to ANTLR
 
FreeSWITCH on RedHat, Fedora, CentOS
FreeSWITCH on RedHat, Fedora, CentOSFreeSWITCH on RedHat, Fedora, CentOS
FreeSWITCH on RedHat, Fedora, CentOS
 
JsSIP: SIP + WebRTC
JsSIP: SIP + WebRTCJsSIP: SIP + WebRTC
JsSIP: SIP + WebRTC
 
Rust
RustRust
Rust
 
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs IstioGOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
GOTO Berlin - Battle of the Circuit Breakers: Resilience4J vs Istio
 

En vedette

Cop Cars: From Buck Boards to Buck Rodgers
Cop Cars: From Buck Boards to Buck RodgersCop Cars: From Buck Boards to Buck Rodgers
Cop Cars: From Buck Boards to Buck RodgersHi Tech Criminal Justice
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The AnswerIan Barber
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Pratical mpi programming
Pratical mpi programmingPratical mpi programming
Pratical mpi programmingunifesptk
 
Europycon2011: Implementing distributed application using ZeroMQ
Europycon2011: Implementing distributed application using ZeroMQEuropycon2011: Implementing distributed application using ZeroMQ
Europycon2011: Implementing distributed application using ZeroMQfcrippa
 
Overview of ZeroMQ
Overview of ZeroMQOverview of ZeroMQ
Overview of ZeroMQpieterh
 

En vedette (20)

Cop Cars: From Buck Boards to Buck Rodgers
Cop Cars: From Buck Boards to Buck RodgersCop Cars: From Buck Boards to Buck Rodgers
Cop Cars: From Buck Boards to Buck Rodgers
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
Pratical mpi programming
Pratical mpi programmingPratical mpi programming
Pratical mpi programming
 
Europycon2011: Implementing distributed application using ZeroMQ
Europycon2011: Implementing distributed application using ZeroMQEuropycon2011: Implementing distributed application using ZeroMQ
Europycon2011: Implementing distributed application using ZeroMQ
 
Overview of ZeroMQ
Overview of ZeroMQOverview of ZeroMQ
Overview of ZeroMQ
 

Similaire à [Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)

01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsTony Nguyen
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsYoung Alista
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsJames Wong
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsFraboni Ec
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsHoang Nguyen
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsLuis Goldster
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsHarry Potter
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 modelsanandme07
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3Shah Zaib
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012
Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012
Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012Benjamin Cabé
 
EuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIEuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIDan Holmes
 

Similaire à [Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ) (20)

01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 models
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Massively Parallel Architectures
Massively Parallel ArchitecturesMassively Parallel Architectures
Massively Parallel Architectures
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012
Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012
Using Eclipse and Lua for the Internet of Things - EclipseDay Googleplex 2012
 
EuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPIEuroMPI 2013 presentation: McMPI
EuroMPI 2013 presentation: McMPI
 

Plus de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

Plus de npinto (16)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Dernier

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 

Dernier (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)

  • 1. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #7: GPU Cluster Programming | March 8th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2. Administrativia • Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11 • Project info: http://www.cs264.org/projects/projects.html • Project ideas: http://forum.cs264.org/index.php?board=6.0 • Project proposal deadline: Fri 3/25/11 (but you should submit way before to start working on it asap) • Need a private private repo for your project? Let us know! Poll on the forum: http://forum.cs264.org/index.php?topic=228.0
  • 3. Goodies • Guest Lectures: 14 distinguished speakers • Schedule updated (see website)
  • 4. Goodies (cont’d) • Amazon AWS free credits coming soon (only for students who completed HW0+1) • It’s more than $14,000 donation for the class! • Special thanks: Kurt Messersmith @ Amazon
  • 5. Goodies (cont’d) • Best Project Prize: Tesla C2070 (Fermi) Board • It’s more than $4,000 donation for the class! • Special thanks: David Luebke & Chandra Cheij @ NVIDIA
  • 6. During this course, r CS264 adapted fo we’ll try to “ ” and use existing material ;-)
  • 8. Outline 1. The problem 2. Intro to MPI 3. MPI Basics 4. MPI+CUDA 5. Other approaches
  • 9. Outline 1. The problem 2. Intro to MPI 3. MPI Basics 4. MPI+CUDA 5. Other approaches
  • 10. The Problem Many computational problems too big for single CPU Lack of RAM Lack of CPU cycles Want to distribute work between many CPUs slide by Richard Edgar
  • 11. Types of Parallelism Some computations are ‘embarrassingly parallel’ Can do a lot of computation on minimal data RC5 DES, SETI@HOME etc. Solution is to distribute across the Internet Use TCP/IP or similar slide by Richard Edgar
  • 12. Types of Parallelism Some computations very tightly coupled Have to communicate a lot of data at each step e.g. hydrodynamics Internet latencies much too high Need a dedicated machine slide by Richard Edgar
  • 13. Tightly Coupled Computing Two basic approaches Shared memory Distributed memory Each has advantages and disadvantages slide by Richard Edgar
  • 14. dvariables. variables. uted memory private memory for each processor, only acces uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede ocessor, so no synchronization for memory accesses neede mationexchanged by sending data from one processor to ano ation exchanged by sending data from one processor to an interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M PP PP PP PP PP PP Interconnection Network Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” approach increasingly common “shared memory” d approach increasingly common now: mostly hybrid
  • 15. variables. uted memory private memory for each processor, only acces Some terminology ocessor, so no synchronization for memory accesses neede ation exchanged by sending data from one processor to ano interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M P P P PP PP PP Interconnection Network Interconnection Network Interconnection Network M M M “distributed memory” approach increasingly common “shared memory” now: mostly hybrid
  • 16. Shared Memory Machines Have lots of CPUs share the same memory banks Spawn lots of threads Each writes to globally shared memory Multicore CPUs now ubiquitous Most computers now ‘shared memory machines’ slide by Richard Edgar
  • 17. Shared Memory Machines NASA ‘Columbia’ Computer Up to 2048 cores in single system slide by Richard Edgar
  • 18. Shared Memory Machines Spawning lots of threads (relatively) easy pthreads, OpenMP Don’t have to worry about data location Disadvantage is memory performance scaling Frontside bus saturates rapidly Can use Non-Uniform Memory Architecture (NUMA) Silicon Graphics Origin & Altix series Gets expensive very fast slide by Richard Edgar
  • 19. d variables. uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede mation exchanged by sending data from one processor to an interconnection network using explicit communication opera M M M PP PP PP P P P Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” “shared memory” d approach increasingly common now: mostly hybrid
  • 20. Distributed Memory Clusters Alternative is a lot of cheap machines High-speed network between individual nodes Network can cost as much as the CPUs! How do nodes communicate? slide by Richard Edgar
  • 21. Distributed Memory Clusters NASA ‘Pleiades’ Cluster 51,200 cores slide by Richard Edgar
  • 22. Distributed Memory Model Communication is key issue Each node has its own address space (exclusive access, no global memory?) Could use TCP/IP Painfully low level Solution: a communication protocol like message- passing (e.g. MPI) slide by Richard Edgar
  • 23. Distributed Memory Model All data must be explicitly partitionned Exchange of data by explicit communication slide by Richard Edgar
  • 24. Outline 1. The problem 2. Intro to MPI 3. MPI Basics 4. MPI+CUDA 5. Other approaches
  • 25. Message Passing Interface MPI is a communication protocol for parallel programs Language independent Open standard Originally created by working group at SC92 Bindings for C, C++, Fortran, Python, etc. http://www.mcs.anl.gov/research/projects/mpi/ http://www.mpi-forum.org/ slide by Richard Edgar
  • 26. Message Passing Interface MPI processes have independent address spaces Communicate by sending messages Means of sending messages invisible Use shared memory if available! (i.e. can be used behind the scenes shared memory architectures) On Level 5 (Session) and higher of OSI model slide by Richard Edgar
  • 28. Message Passing Interface MPI is a standard, a specification, for message-passing libraries Two major implementations of MPI MPICH OpenMPI Programs should work with either slide by Richard Edgar
  • 29. Basic Idea • Usually programmed with SPMD model (single program, multiple data) • In MPI-1 number of tasks is static - cannot dynamically spawn new tasks at runtime. Enhanced in MPI-2. • No assumptions on type of interconnection network; all processors can send a message to any other processor. • All parallelism explicit - programmer responsible for correctly identifying parallelism and implementing parallel algorithms adapted from Berger & Klöckner (NYU 2010)
  • 31. Hello World #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world from %d of %dn", rank, size); MPI_Finalize(); return 0; } adapted from Berger & Klöckner (NYU 2010)
  • 32. Hello World To compile: Need to load “MPI” wrappers in addition to the compiler modules (OpenMPI,‘ MPICH,...) module load mpi/openmpi/1.2.8/gnu module load openmpi/intel/1.3.3 To compile: mpicc hello.c To run: need to tell how many processes you are requesting mpiexec -n 10 a.out (mpirun -np 10 a.out) adapted from Berger & Klöckner (NYU 2010)
  • 33. The beauty of data visualization http://www.youtube.com/watch?v=pLqjQ55tz-U
  • 34. The beauty of data visualization http://www.youtube.com/watch?v=pLqjQ55tz-U
  • 36.
  • 37. “ They’ve done studies, you know. 60% of the time, it works every time... ” - Brian Fantana (Anchorman, 2004)
  • 38. Outline 1. The problem 2. Intro to MPI 3. MPI Basics 4. MPI+CUDA 5. Other approaches
  • 39. Basic MPI MPI is a library of routines Bindings exist for many languages Principal languages are C, C++ and Fortran Python: mpi4py We will discuss C++ bindings from now on http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm slide by Richard Edgar
  • 40. Basic MPI MPI allows processes to exchange messages Processes are members of communicators Communicator shared by all is MPI::COMM_WORLD In C++ API, communicators are objects Within a communicator, each process has unique ID slide by Richard Edgar
  • 41. A Minimal MPI Program #include <iostream> using namespace std; #include “mpi.h” int main( int argc, char* argv ) { Very much a minimal program MPI::Init( argc, argv ); No actual cout << “Hello World!” << endl; communication occurs MPI::Finalize(); return( EXIT_SUCCESS ); } slide by Richard Edgar
  • 42. A Minimal MPI Program To compile MPI programs use mpic++ mpic++ -o MyProg myprog.cpp The mpic++ command is a wrapper for default compiler Adds in libraries Use mpic++ --show to see what it does Will also find mpicc, mpif77 and mpif90 (usually) slide by Richard Edgar
  • 43. A Minimal MPI Program To run the program, use mpirun mpirun -np 2 ./MyProg The -np 2 option launches two processes Check documentation for your cluster Number of processes might be implicit Program should print “Hello World” twice slide by Richard Edgar
  • 44. Communicators Processes are members of communicators A process can Find the size of a given communicator Determine its ID (or rank) within it Default communicator is MPI::COMM_WORLD slide by Richard Edgar
  • 45. Communicators int nProcs, iMyProc; MPI::Init( argc, argv ); Queries COMM_WORLD communicator for nProcs = MPI::COMM_WORLD.Get_size(); Number of processes iMyProc = MPI::COMM_WORLD.Get_rank(); Current process rank (ID) cout << “Hello from process ”; cout << iMyProc << “ of ”; Prints these out cout << nProcs << endl; Process rank counts from zero MPI::Finalize(); slide by Richard Edgar
  • 46. Communicators By convention, process with rank 0 is master const int iMasterProc = 0; Can have more than one communicator Process may have different rank within each slide by Richard Edgar
  • 47. Messages Haven’t sent any data yet Communicators have Send and Recv methods for this One process posts a Send Must be matched by Recv in the target process slide by Richard Edgar
  • 48. Sending Messages A sample send is as follows: int a[10]; MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag ); The method prototype is void Comm::Send( const void* buf, int count, const Datatype& datatype, int dest, int tag) const MPI copies the buffer into a system buffer and returns No delivery notification slide by Richard Edgar
  • 49. Receiving Messages Similar call to receive MPI::ANY_SOURCE int a[10]; MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, MPI::ANY_TAG iMyTag); Function prototype is void Comm::Recv( void* buf, int count, const Datatype& datatype, int source, int tag) const Blocks until data arrives slide by Richard Edgar
  • 50. MPI Datatypes MPI Datatype C/C++ MPI datatypes are MPI::CHAR signed char independent of MPI::SHORT signed short Language MPI::INT signed int Endianess MPI::LONG signed long Most common listed MPI::FLOAT float opposite MPI::DOUBLE double MPI::BYTE Untyped byte data slide by Richard Edgar
  • 51. MPI Send & Receive if( iMyProc == iMasterProc ) { for( int i=1; i<nProcs; i++ ) { int iMessage = 2 * i + 1; cout << “Sending ” << iMessage << “ to process ” << i << endl; MPI::COMM_WORLD.Send( &iMessage, 1, Master process sends MPI::INT, i, iTag ); out numbers } } else { int iMessage; Worker processes print MPI::COMM_WORLD.Recv( &iMessage, 1, MPI::INT, out number received iMasterProc, iTag ); cout << “Process ” << iMyProc << “ received ” << iMessage << endl; } slide by Richard Edgar
  • 52. Six Basic MPI Routines Have now encounted six MPI routines MPI::Init(), MPI::Finalize() MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(), MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv() These are enough to get started ;-) More sophisticated routines available... slide by Richard Edgar
  • 53. Collective Communications Send and Recv are point-to-point Communicate between specific processes Sometimes we want all processes to exchange data These are called collective communications slide by Richard Edgar
  • 54. Barriers Barriers require all processes to synchronise MPI::COMM_WORLD.Barrier(); Processes wait until all processes arrive at barrier Potential for deadlock Bad for performance Only use if necessary slide by Richard Edgar
  • 55. Broadcasts Suppose one process has array to be shared with all int a[10]; MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc ); If process has rank iSrcProc, it will send the array Other processes will receive it All will have a[10] identical to iSrcProc on completion slide by Richard Edgar
  • 56. MPI Broadcast P0 A P0 A Broadcast A P1 P1 P2 P2 A P3 P3 A MPI Bcast(&buf, count, datatype, root, comm) All processors must call MPI Bcast with the same root value. adapted from Berger & Klöckner (NYU 2010)
  • 57. Reductions Suppose we have a large array split across processes We want to sum all the elements Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM Also MPI::COMM_WORLD.Allreduce() variant Can perform MAX, MIN, MAXLOC, MINLOC too slide by Richard Edgar
  • 58. MPI Reduce P0 A P0 ABCD P1 B Reduce P1 P2 C P2 P3 D P3 Reduction operators can be min, max, sum, multiply, logical ops, max value and location ... Must be associative (commutative optional) adapted from Berger & Klöckner (NYU 2010)
  • 59. Scatter and Gather Split a large array between processes Use MPI::COMM_WORLD.Scatter() Each process receives part of the array Combine small arrays into one large one Use MPI::COMM_WORLD.Gather() Designated process will construct entire array Has MPI::COMM_WORLD.Allgather() variant slide by Richard Edgar
  • 60. MPI Scatter/Gather P0 A B C D P0 A Scatter P1 P1 B Gather P2 P2 C P3 P3 D adapted from Berger & Klöckner (NYU 2010)
  • 61. MPI Allgather P0 A P0 A B C D B Allgather A B C D P1 P1 P2 C P2 A B C D P3 D P3 A B C D adapted from Berger & Klöckner (NYU 2010)
  • 62. MPI Alltoall P0 A0 A1 A2 A3 P0 A0 B0 C0 D0 B0 B1 B2 B3 Alltoall A1 B1 C1 D1 P1 P1 P2 C0 C1 C2 C3 P2 A2 B2 C2 D2 P3 D0 D1 D2 D3 P3 A3 B3 C3 D3 adapted from Berger & Klöckner (NYU 2010)
  • 63. Asynchronous Messages An asynchronous API exists too Have to allocate buffers Have to check if send or receive has completed Will give better performance Trickier to use slide by Richard Edgar
  • 64. User-Defined Datatypes Usually have complex data structures Require means of distributing these Can pack & unpack manually MPI allows us to define own datatypes for this slide by Richard Edgar
  • 65. MPI-2 • One-sided RMA (remote memory access) communication • potential for greater efficiency, easier programming. • Use ”windows” into memory to expose regions for access • Race conditions now possible. • Parallel I/O like message passing but to file system not other processes. • Allows for dynamic number of processes and inter-communicators (as opposed to intra-communicators) • Cleaned up MPI-1 adapted from Berger & Klöckner (NYU 2010)
  • 66. RMA • Processors can designate portions of its address space as available to other processors for read/write operations (MPI Get, MPI Put, MPI Accumulate). • RMA window objects created by collective window-creation fns. (MPI Win create must be called by all participants) • Before accessing, call MPI Win fence (or other synchr. mechanisms) to start RMA access epoch; fence (like a barrier) separates local ops on window from remote ops • RMA operations are no-blocking; separate synchronization needed to check completion. Call MPI Win fence again. RMA window Put P0 local memory P1 local memory adapted from Berger & Klöckner (NYU 2010)
  • 68. MPIMPI Bugs Sample Bugs Only works for even number of processors. What’s w rong? adapted from Berger & Klöckner (NYU 2010)
  • 69. MPIMPI Bugs Sample Bugs Only works for even number of processors. adapted from Berger & Klöckner (NYU 2010)
  • 70. MPI Bugs Sample MPI Bugs Suppose you have a local variable “energy” and you want to sum all the processors “energy” to and wanttotal energy Supose have local variable, e.g. energy, find the to sum all of the system energy to find total energy of the system. the processors Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) hat’s w rong? Using the same variable, as in W MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD) adapted from Berger & Klöckner (NYU 2010)
  • 72. Communication Topologies Some topologies very common Grid, hypercube etc. API provided to set up communicators following these slide by Richard Edgar
  • 73. Parallel Performance Recall Amdahl’s law: if T1 = serial cost + parallel cost then Tp = serial cost + parallel cost/p But really Tp = serial cost + parallel cost/p + Tcommunication How expensive is it? adapted from Berger & Klöckner (NYU 2010)
  • 74. Network Characteristics Interconnection network connects nodes, transfers data Important qualities: • Topology - the structure used to connect the nodes • Routing algorithm - how messages are transmitted between processors, along which path (= nodes along which message transferred). • Switching strategy = how message is cut into pieces and assigned a path • Flow control (for dealing with congestion) - stall, store data in buffers, re-route data, tell source to halt, discard, etc. adapted from Berger & Klöckner (NYU 2010)
  • 75. Interconnection Network Represent as graph G = (V , E), V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by: • diameter - maximum over all pairs of nodes of the shortest path between the nodes (length of path in message transmission) • degree - number of direct links for a node (number of direct neighbors) • bisection bandwidth - minimum number of edges that must be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously) • node/edge connectivity - numbers of node/edges that must fail to disconnect the network (measure of reliability) adapted from Berger & Klöckner (NYU 2010)
  • 76. Linear Array • p vertices, p − 1 links • Diameter = p − 1 • Degree = 2 • Bisection bandwidth = 1 • Node connectivity = 1, edge connectivity = 1 adapted from Berger & Klöckner (NYU 2010)
  • 77. Ring topology • diameter = p/2 • degree = 2 • bisection bandwidth = 2 • node connectivity = 2 edge connectivity = 2 adapted from Berger & Klöckner (NYU 2010)
  • 78. Mesh topology √ • diameter = 2( p − 1) √ 3d mesh is 3( 3 p − 1) • degree = 4 (6 in 3d ) √ • bisection bandwidth p • node connectivity 2 edge connectivity 2 Route along each dimension in turn adapted from Berger & Klöckner (NYU 2010)
  • 79. Torus topology Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over mesh adapted from Berger & Klöckner (NYU 2010)
  • 80. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 01001100 1101 0101 100 101 1000 1001 00 01 000 001 0000 0001 • p = 2k processors labelled with binary numbers of length k • k -dimensional cube constructed from two (k − 1)-cubes • Connect corresponding procs if labels differ in 1 bit (Hamming distance d between 2 k -bit binary words = path of length d between 2 nodes) adapted from Berger & Klöckner (NYU 2010)
  • 81. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 01001100 1101 0101 100 101 1000 1001 00 01 000 001 0000 0001 • diameter = k ( =log p) • degree = k • bisection bandwidth = p/2 • node connectivity k edge connectivity k adapted from Berger & Klöckner (NYU 2010)
  • 82. Dynamic Networks Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection. • bus • crossbar • multistage network - e.g. butterfly, omega, baseline adapted from Berger & Klöckner (NYU 2010)
  • 83. Crossbar P1 P2 Pn M1 M2 Mm • Connecting n inputs and m outputs takes nm switches. (Typically only for small numbers of processors) • At each switch can either go straight or change dir. • Diameter = 1, bisection bandwidth = p adapted from Berger & Klöckner (NYU 2010)
  • 84. Butterfly 16 × 16 butterfly network: stage 0 stage 1 stage 2 stage 3 000 001 010 011 100 101 110 111 for p = 2k +1 processors, k + 1 stages, 2k switches per stage, 2 × 2 switches adapted from Berger & Klöckner (NYU 2010)
  • 85. Fat tree • Complete binary tree • Processors at leaves • Increase links for higher bandwidth near root adapted from Berger & Klöckner (NYU 2010)
  • 86. Current picture • Old style: mapped algorithms to topologies • New style: avoid topology-specific optimizations • Want code that runs on next year’s machines too. • Topology awareness in vendor MPI libraries? • Software topology - easy of programming, but not used for performance? adapted from Berger & Klöckner (NYU 2010)
  • 87. Should we care ? • Old school: map algorithms to specific topologies • New school: avoid topology-specific optimimizations (the code should be optimal on next year’s infrastructure....) • Meta-programming / Auto-tuning ?
  • 88. chart in table format using the statistics page. A direct link to the statistics is also available. Top500 Interconnects Statisti Top500 06/20 Statisti Vendo Genera Search adapted from Berger & Klöckner (NYU 2010)
  • 89. MPI References • Lawrence Livermore tutorial https:computing.llnl.gov/tutorials/mpi/ • Using MPI Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum • Using MPI-2 Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur • Lots of other on-line tutorials, books, etc. adapted from Berger & Klöckner (NYU 2010)
  • 90. Ignite: Google Trends http://www.youtube.com/watch?v=m0b-QX0JDXc
  • 91. Outline 1. The problem 2. Intro to MPI 3. MPI Basics 4. MPI+CUDA 5. Other approaches
  • 92. MPI with CUDA MPI and CUDA almost orthogonal Each node simply becomes faster Problem matching MPI processes to GPUs Use compute-exclusive mode on GPUs Tell cluster environment to limit processes per node Have to know your cluster documentation slide by Richard Edgar
  • 93. Data Movement Communication now very expensive GPUs can only communicate via their hosts Very laborious Again: need to minimize communication slide by Richard Edgar
  • 94. MPI Summary MPI provides cross-platform interprocess communication Invariably available on computer clusters Only need six basic commands to get started Much more sophistication available slide by Richard Edgar
  • 95. Outline 1. The problem 2. Intro to MPI 3. MPI Basics 4. MPI+CUDA 5. Other approaches
  • 96.
  • 97. ZeroMQ • ‘messaging middleware’ ‘TCP on steroids’ ‘new layer on the networking stack’ • not a complete messaging system • just a simple messaging library to be used programmatically. • a “pimped” socket interface allowing you to quickly design / build a complex communication system without much effort http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
  • 98. ZeroMQ • Fastest. Messaging. Ever. • Excellent documentation: • examples • white papers for everything • Bindings for Ada, Basic, C, Chicken Scheme, Common Lisp, C#, C++, D, Erlang*, Go*, Haskell*, Java, Lua, node.js, Objective-C, ooc, Perl, Perl, PHP, Python, Racket, Ruby,Tcl http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
  • 99. Message Patterns http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
  • 100. Demo: Why ZeroMQ ? http://www.youtube.com/watch?v=_JCBphyciAs
  • 101. MPI vs ZeroMQ ? • MPI is a specification, ZeroMQ is an implementation. • Design: • MPI is designed for tightly-coupled compute clusters with fast and reliable networks. • ZeroMQ is designed for large distributed systems (web-like). • Fault tolerance: • MPI has very limited facilities for fault tolerance (the default error handling behavior in most implementations is a system-wide fail, ouch!). • ZeroMQ is resilient to faults and network instability. • ZeroMQ could be a good transport layer for an MPI-like implementation. http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq
  • 102. F ast Fo rward CUDASA
  • 103.
  • 104. !"#$%&#$"'()*"'#+, CUDASA: Computed Unified Device &'9(8:/#1;/( Systems Architecture 1234J'1/.(8#$'2%,0,$&'3$C,B$'4*BI,#$B#8*$' 78 !"#$%&'()*)++$+,-.'/0'1234'0/*')'-,%5+$'672'#/ -./(01%102 .8+#,9672'-:-#$.- ;<=>'?$-+)>'@8)&*/7+$">'AAA 31#4"56(01%102 672'B+8-#$*'$%C,*/%.$%#- D:*,%$#>'=%0,%,E)%&>'D:*,FG6>'AAA 1/%-,-#$%#'&$C$+/($*',%#$*0)B$ !)-,+:'$.H$&&$&'#/'#I$'1234'B/.(,+$'(*/B$-- !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 105. !"#$%&'()%*)+%, 233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'() *&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(; ;-.*-& ?4061,$57$3&)&44(4 -0$60@@+756&.507$?(./((7$;-.*-& ?4061, #7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6( -0$60=($@0=5A56&.507$)(E+5)(= !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 106. !"#$%&'()%*)+%, 233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'() *8#9$-./'0'1./'0) -./'234"':;0,.< =7($*8#$(>+&4,$07($"=?@A$.;)(&B (%#; C4061,$57$3&)&44(4 244$C4061,$,;&)($60DD07$,',.(D$D(D0)'$:(>EF$G#H2$I40C&4$D(D0)'< J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#, !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 107. !"#$%&'()%*)+%, 233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'() *8#9$#+-./%'0/1#$%*'23':70;(< =7($*8#$(>+&4,$07($?"@$3)06(,, ?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4 -0$57.)57,56$A40B&4$C(C0)'$ D5,.)5B+.(;$,E&)(;$C(C0)'$C&7&A(C(7. F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#, !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 108. !"#$%&'()%*)+%, 233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'() 8(9+(7.5&4$&33456&.507$3)06(,, 2):5.)&)'$;<;==$&33456&.507$60>( 24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)' B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4 !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 109. !"#$%"$&'()*&#+,-#+ !"#$%&'( )#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7 ;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71 8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4 ;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1 97",7+44+<5:5'2 A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"# !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 110. !"#$%"$&'()*&#+,-#+'./-#*01 !"#$%&'()*+!+',-.&/ !"#$%&'()*+',-.&/ !!"#$%&#!!'($)*'"+,-./0#$&12'3&4&561647'8999:';;'<=>?@= 925,34-5' :2"%464&8 !!A$B1!!'($)*'A+,-./9997'8 ;;'CDEF 999' 01&,234-5' "+,-.'GGG'<"H'<%H'IB'JJJ/3&4&561647K (-564728"34-5 : !!1&BL!! ($)*'1+,-./0#$&12'3&4&561647'8 ;;'CDEF =&> A+,-./3&4&561647K 925,34-5' :2"%464&8 : !!B6M,6-.6!! ($)*'B+,-./9997'8 ;;'NOO ;&5&8"%4<&. 1+,-.'GGG'<"'JJJ/3&4&561647K 01&,234-5' (-564728"34-5 : !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 111. !"#$%"$&'()*&#+,-#+'./-#*01 !"#$%&'$()* +,-)#./ 0*$.%*&1 23(1$4(*# &--1('&$()*51&6.% !!"#$%#&'#!! *.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012 "3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"5012 9:;51&6.% !!78)*48!! !!+#91'#!! 7:1+012./*8)'5,+-./ *8)'5012./36:#4+,+- +,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()* 23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%# !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 112. !"#$%"%&'(')*& !"#$%&'()%*+,+- ;)< ,#.((%#= /01'."2%#3 4.2'(567/(+'8"'/$(#'9(9:+;2:,. <%/$+%*"$'.(/1,+'.(&'&:+-(=:+(#'$9:+;(2,-'+ >:&&:#(%#$'+=,0'(="#0$%:#/(?'@3@(,$:&%0(="#0$%:#/A >7<BCB(>:&D%2'+ >:.'($+,#/2,$%:#(=+:&(>7<BCB(0:.'($:(>7<B(9%$1($1+',./EFG4 C'2=H0:#$,%#'.(D+'H0:&D%2'+($:(>7<B(0:&D%2'+(D+:0'// 5,/'.(:#(62/,(?>II(D,+/'+A(9%$1(,..'.(>7<BE>7<BCB(="#0$%:#,2%$- J"22(,#,2-/%/(:=(/-#$,K(,#.(/'&,#$%0/(+'8"%+'. !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::
  • 113. !"#$%"%&'(')*&+,-./,0(1%2 !"#$%&'()*+'(,-#./&#,0*.1 !!"#$%!!&'()*&"+,-./)-"&)0&12(#"&314&5&666&7 "89:*:1&$";,."&5 ;/'-<+'=0.'+ )-"&)<&12(#"&31<& >:0&,<0./ *)=>&"#$%?*@0&"#$%A)=< 7&B;#99:;!$";,."!"+,-.< 23456(,7-'#+( '()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5 )*$%#,08&'( )-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1< /09.#,:-' *)=>&"#$%?*@&C&9#;#=DE"#$%?*@< *)=>&"#$%A)=&C&9#;#=DE"#$%A)=< 3-090.#&( 5&666&7 =:.),0*.(8*+? 7 !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 114. !"#$%"%&'(')*&+,-./,0(1%2 !"#$%&'()*+'(,-#./&#,0*.1 '/012#333#%&#4445!"#$67 2'.'-#,'+()*+'(&#3*4, 5*%3(64.),0*.(%#-#$','-/(0.,*(7-#%%'-(/,-4), 8 !"#$#9 :*%4&#,'(/);'+4&'-(<4'4'(70,;(#&&(=&*)>/(*6(,;'((%#<9."=+ 8 %&#9 ?','-$0.'(=40&,@0./(6*-('#);(=&*)> 8 '()*+,-"#'()*%!. 9 A#>'(4%(B!C(7*->'-(,;-'#+/(6-*$(,;'(,;-'#+(%**& D+&'(B!C/(-'<4'/,(.'",(%'.+0.E(=&*)>(6-*$(<4'4' A#0,(6*-(#&&(=&*)>/(,*(='(%-*)'//'+ D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&& !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 115. !"#$%"%&'(')*&+,-%'.*/0,1(2%/ !"#$%&'()*+(&,") -)(+".%&"%/0*%+(1$'%0*,).%234%,)&$'5(6$ 789*%'0)%6":;,+$<&,:$%.$)$'(&$#%$=$)&%+"";%>$?=@%&"%&A'$(#%;""+B -;;+,6(&,")%,**0$*%/'"(#6(*&%:$**(.$*%&" ,**0$%$C$60&,")%<)=+8$*/(")*# ;$'5"':%*A('$#%#,*&',/0&$#%:$:"'1%";$'(&,")* !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 116. !"#$%"%&'(')*&+,-%'.*/0,1(2%/ !"#$%&'&()*$(+,*%&'-%-.$/ 01#+2%'3.-4,*#*(.1)'%53%%&(16')/)*%-'-%-.$/'.7')(162%'1.&% 0#3"'32,)*%$'1.&%'&%&(3#*%)'4#$*'.7')/)*%-'-%-.$/'7.$'8!9 :.1*(1,.,)';($*,#2'#&&$%))'$#16% <1*%$7#3%';(#'!"#$%$&$''(!)*!"#$%$&+,!-.)*!"#$%$/0++ 0;=>'*.':?8@'62.+#2'-%-.$/'-#1#6%-%1* <-42%-%1*%&',)(16'9A<'B%-.*%'9%-.$/'@33%))'CB9@D E.'6,#$#1*(%)'7.$'3.13,$$%1*'1.1F#*.-(3'-%-.$/'#33%))%)'C#)'(1':?8@D :?8@!@'#*.-(3'7,13*(.1)'7.$'8!9 !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 117. !"#$%&#'()$#(*+,+%%"%-#. !"#$%&'#"%()%*+,-%."/"0'1%2'$034%251$3617%8-9:;;< *531=%>/%$>6%>?%@A*+,-%-9:;;%13B0'07%?5/&$3>/ *1>&CDB'#"=%#5BD2'$034%60>&"##3/.%'6613"=%>/%'11%1"E"1#%>?%6'0'11"13#2 @FA) ,;G%H6$"0>/%IJKL%I4I%&>0"# M/$"1%NOOKKL%P%&>0"# 9FA) QRMGM,%N5'=0>%STUOKK QRMGM,%VVKK9!T%A1$0' I%&'0=#% 8(OW(O%1'/"#< QRMGM,%VVKK9! I%&'0=#% 8(OWP%1'/"#< X%&'0=#% 8(OWPWP%1'/"#< P%&'0=#% 8(OWPWPWP%1'/"#< !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 118. !"#$%&#'()$#(*+,+%%"%-#.(/012&34 !"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$506 1<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")* 78%9'&:#,'&:"/%"$%'18;%(<<= >'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?# A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E J+2,"/%03% C%G>A (%G>A# M%G>A# #&"6"%"1"2"6$# N(=OD P(O (ON C(Q CNC<=( C<N< P(< (OP F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!# !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 119. !"#$%&#'()"&*+,-(./,/%%"%0#1 !"##$%#$%&#'&()*+%,-.//%012'3%'(*4#"5%)6"677(7218 9(1*%&61(%:+ ;%&701*("%'#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("'(* ;B???; 86*"2&(1% 234(56%+7# <)= &#8)0*6*2#'+% :CD1% <)= &#880'2&6*2#'+ ECF1 G23@%&#880'2&6*2#'%&#1*1 ,2'37(%!H%42*@%E%-!I1%01(<%61%E%12'37(%-!I%&701*("%'#<(1 :?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#'7L+%M:E%-N7#)1O P(Q02"(1%2'*("R)"#&(11%&#880'2&6*2#' S,/%6&&(11(1%*65(%:CB%*28(1%7#'3("%*@6'%&#8)0*6*2#' T#%646"('(11%#$%<6*6%7#&672*L G23@%J0''(&(116"LO%&#880'2&6*2#'%#K("@(6< !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 120. !"#$%&'("# !"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-; <.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+ ?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@ B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4 (-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,- (2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+-- !611+,*':1/D+=*'-*2*+& ()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")* <.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@ $6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"- G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+ !"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 121. F ast Fo rward MultiGPU MapReduce
  • 122.
  • 123. MapReduce http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
  • 124. Why MapReduce? • Simple programming model • Parallel programming model • Scalable • Previous GPU work: neither multi-GPU nor out-of-core
  • 125. Benchmarks—Which • Matrix Multiplication (MM) • Word Occurrence (WO) • Sparse-Integer Occurrence (SIO) • Linear Regression (LR) • K-Means Clustering (KMC) • (Volume Renderer—presented 90 minutes ago @ MapReduce ’10)
  • 126. Benchmarks—Why • Needed to stress aspects of GPMR • Unbalanced work (WO) • Multiple emits/Non-uniform number of emits (LR, KMC, WO) • Sparsity of keys (SIO) • Accumulation (WO, LR, KMC) • Many key-value pairs (SIO) • Compute Bound Scalability (MM)
  • 128. WO. We test GPMR against all available input sets. Benchmarks—Results MM KMC LR SIO WO 1-GPU Speedup 162.712 2.991 1.296 1.450 11.080 4-GPU Speedup 559.209 11.726 4.085 2.322 18.441 vs. CPU TABLE 2: Speedup for GPMR over Phoenix on our large (second- biggest) input data from our first set. The exception is MM, for which we use our small input set (Phoenix required almost twenty seconds TAB to multiply two 1024 × 1024 matrices). writ all littl boil MM KMC WO func GPM 1-GPU Speedup 2.695 37.344 3.098 of t 4-GPU Speedup 10.760 129.425 11.709 vs. GPU TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix Multiplication, an 8M-point K-Means Clustering, and a 512 MB Word Occurrence. These sizes represent the largest problems that can meet the in-core memory requirements of Mars.
  • 132. one more thing or two...
  • 133. Life/Code Hacking #3 The Pomodoro Technique
  • 134. Life/Code Hacking #3 The Pomodoro Technique http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions
  • 136. CO ME