[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #7: GPU Cluster Programming | March 8th, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Administrativia
• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11

• Project info: http://www.cs264.org/projects/projects.html

• Project ideas: http://forum.cs264.org/index.php?board=6.0

• Project proposal deadline: Fri 3/25/11
(but you should submit way before to start working on it asap)

• Need a private private repo for your project?

Let us know! Poll on the forum:
http://forum.cs264.org/index.php?topic=228.0

Goodies
• Guest Lectures: 14 distinguished speakers
• Schedule updated (see website)

Goodies (cont’d)
• Amazon AWS free credits coming soon
(only for students who completed HW0+1)

• It’s more than $14,000 donation for the class!
• Special thanks: Kurt Messersmith @ Amazon

Goodies (cont’d)
• Best Project Prize: Tesla C2070 (Fermi) Board
• It’s more than $4,000 donation for the class!
• Special thanks:
David Luebke & Chandra Cheij @ NVIDIA

During this course,
r CS264
adapted fo

we’ll try to

“ ”

and use existing material ;-)

Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches

The Problem

Many computational problems too big for single CPU
Lack of RAM
Lack of CPU cycles
Want to distribute work between many CPUs

slide by Richard Edgar

Types of Parallelism

Some computations are ‘embarrassingly parallel’
Can do a lot of computation on minimal data
RC5 DES, SETI@HOME etc.
Solution is to distribute across the Internet
Use TCP/IP or similar


Types of Parallelism

Some computations very tightly coupled
Have to communicate a lot of data at each step
e.g. hydrodynamics
Internet latencies much too high
Need a dedicated machine


Tightly Coupled Computing

Two basic approaches
Shared memory
Distributed memory
Each has advantages and disadvantages


dvariables.
variables.
uted memory private memory for each processor, only acces
uted memory private memory for each processor, only acce
Some terminology
ocessor, so no synchronization for memory accesses neede
mationexchanged by sending data from one processor to ano
ation exchanged by sending data from one processor to an
interconnection network using explicit communication opera
M
M M
M M
M PP PP PP

PP PP PP
Interconnection Network

Interconnection Network M
M M
M M
M

“distributed memory”
approach increasingly common “shared memory”
d approach increasingly common
now: mostly hybrid

variables.
uted memory private memory for each processor, only acces

Some terminology
ation exchanged by sending data from one processor to ano
M
M M
M M
M P P P

PP PP PP

Interconnection Network M M M

“distributed memory”
approach increasingly common “shared memory”
now: mostly hybrid

Shared Memory Machines

Have lots of CPUs share the same memory banks
Spawn lots of threads
Each writes to globally shared memory
Multicore CPUs now ubiquitous
Most computers now ‘shared memory machines’



NASA ‘Columbia’ Computer
Up to 2048 cores in single system

Spawning lots of threads (relatively) easy
pthreads, OpenMP
Don’t have to worry about data location
Disadvantage is memory performance scaling
Frontside bus saturates rapidly
Can use Non-Uniform Memory Architecture (NUMA)
Silicon Graphics Origin & Altix series
Gets expensive very fast


d variables.
uted memory private memory for each processor, only acce
Some terminology
mation exchanged by sending data from one processor to an

M M M PP PP PP

P P P

Interconnection Network M M M
M M M

“distributed memory” “shared memory”
d approach increasingly common
now: mostly hybrid

Distributed Memory Clusters

Alternative is a lot of cheap machines
High-speed network between individual nodes
Network can cost as much as the CPUs!
How do nodes communicate?


Distributed Memory Clusters

NASA ‘Pleiades’ Cluster
51,200 cores

Distributed Memory Model
Communication is key issue
Each node has its own address space
(exclusive access, no global memory?)
Could use TCP/IP
Painfully low level
Solution: a communication protocol like message-
passing (e.g. MPI)


Distributed Memory Model

All data must be explicitly partitionned
Exchange of data by explicit communication


Message Passing Interface

MPI is a communication protocol for parallel programs
Language independent
Open standard
Originally created by working group at SC92
Bindings for C, C++, Fortran, Python, etc.

http://www.mcs.anl.gov/research/projects/mpi/
http://www.mpi-forum.org/


MPI processes have independent address spaces
Communicate by sending messages
Means of sending messages invisible
Use shared memory if available! (i.e. can be used
behind the scenes shared memory architectures)
On Level 5 (Session) and higher of OSI model



MPI is a standard, a speciﬁcation, for message-passing
libraries
Two major implementations of MPI
MPICH
OpenMPI
Programs should work with either


Basic Idea

• Usually programmed with SPMD model (single program,
multiple data)
• In MPI-1 number of tasks is static - cannot dynamically
spawn new tasks at runtime. Enhanced in MPI-2.
• No assumptions on type of interconnection network; all
processors can send a message to any other processor.
• All parallelism explicit - programmer responsible for
correctly identifying parallelism and implementing parallel
algorithms

adapted from Berger & Klöckner (NYU 2010)

Hello World

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
int rank, size;
MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Hello world from %d of %dn", rank, size);

MPI_Finalize();
return 0;
}

Hello World
To compile: Need to load “MPI” wrappers in addition to the
compiler modules (OpenMPI,‘ MPICH,...)
module load mpi/openmpi/1.2.8/gnu
module load openmpi/intel/1.3.3

To compile: mpicc hello.c
To run: need to tell how many processes you are requesting
mpiexec -n 10 a.out (mpirun -np 10 a.out)


The beauty of data
visualization

http://www.youtube.com/watch?v=pLqjQ55tz-U

“ They’ve done studies, you
know. 60% of the time, it
works every time... ”

- Brian Fantana
(Anchorman, 2004)

Basic MPI

MPI is a library of routines
Bindings exist for many languages
Principal languages are C, C++ and Fortran
Python: mpi4py
We will discuss C++ bindings from now on

http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm


Basic MPI

MPI allows processes to exchange messages
Processes are members of communicators
Communicator shared by all is MPI::COMM_WORLD

In C++ API, communicators are objects
Within a communicator, each process has unique ID


A Minimal MPI Program
#include <iostream>
using namespace std;

#include “mpi.h”

int main( int argc, char* argv ) { Very much a minimal
program
MPI::Init( argc, argv );
No actual
cout << “Hello World!” << endl;
communication occurs
MPI::Finalize();

return( EXIT_SUCCESS );
}



To compile MPI programs use mpic++
mpic++ -o MyProg myprog.cpp

The mpic++ command is a wrapper for default compiler
Adds in libraries
Use mpic++ --show to see what it does
Will also ﬁnd mpicc, mpif77 and mpif90 (usually)



To run the program, use mpirun
mpirun -np 2 ./MyProg

The -np 2 option launches two processes
Check documentation for your cluster
Number of processes might be implicit
Program should print “Hello World” twice


Communicators

Processes are members of communicators
A process can
Find the size of a given communicator
Determine its ID (or rank) within it
Default communicator is MPI::COMM_WORLD


Communicators

int nProcs, iMyProc;
MPI::Init( argc, argv ); Queries COMM_WORLD communicator for
nProcs = MPI::COMM_WORLD.Get_size(); Number of processes
iMyProc = MPI::COMM_WORLD.Get_rank();
Current process rank (ID)
cout << “Hello from process ”;
cout << iMyProc << “ of ”; Prints these out
cout << nProcs << endl; Process rank counts from zero
MPI::Finalize();


Communicators

By convention, process with rank 0 is master
const int iMasterProc = 0;

Can have more than one communicator
Process may have different rank within each


Messages

Haven’t sent any data yet
Communicators have Send and Recv methods for this
One process posts a Send
Must be matched by Recv in the target process


Sending Messages
A sample send is as follows:
int a[10];
MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );

The method prototype is
void Comm::Send( const void* buf, int count,
const Datatype& datatype,
int dest, int tag) const

MPI copies the buffer into a system buffer and returns
No delivery notiﬁcation


Receiving Messages

Similar call to receive MPI::ANY_SOURCE
int a[10];
MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, MPI::ANY_TAG
iMyTag);

Function prototype is
void Comm::Recv( void* buf, int count,
const Datatype& datatype,
int source, int tag) const

Blocks until data arrives


MPI Datatypes
MPI Datatype C/C++

MPI datatypes are MPI::CHAR signed char

independent of MPI::SHORT signed short

Language MPI::INT signed int

Endianess MPI::LONG signed long

Most common listed MPI::FLOAT float

opposite MPI::DOUBLE double

MPI::BYTE Untyped byte data


MPI Send & Receive
if( iMyProc == iMasterProc ) {
for( int i=1; i<nProcs; i++ ) {
int iMessage = 2 * i + 1;
cout << “Sending ” << iMessage <<
“ to process ” << i << endl;
MPI::COMM_WORLD.Send( &iMessage, 1, Master process sends
MPI::INT,
i, iTag ); out numbers
}
} else {
int iMessage; Worker processes print
MPI::COMM_WORLD.Recv( &iMessage, 1,
MPI::INT,
out number received
iMasterProc, iTag );
cout << “Process ” << iMyProc <<
“ received ” << iMessage << endl;
}


Six Basic MPI Routines

Have now encounted six MPI routines
MPI::Init(), MPI::Finalize()
MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),
MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()

These are enough to get started ;-)
More sophisticated routines available...


Collective Communications

Send and Recv are point-to-point
Communicate between speciﬁc processes
Sometimes we want all processes to exchange data
These are called collective communications


Barriers

Barriers require all processes to synchronise
MPI::COMM_WORLD.Barrier();

Processes wait until all processes arrive at barrier
Potential for deadlock
Bad for performance
Only use if necessary


Broadcasts

Suppose one process has array to be shared with all
int a[10];
MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );

If process has rank iSrcProc, it will send the array
Other processes will receive it
All will have a[10] identical to iSrcProc on completion


MPI Broadcast

P0 A P0 A

Broadcast A
P1 P1

P2 P2 A

P3 P3 A

MPI Bcast(&buf, count, datatype, root, comm)
All processors must call MPI Bcast with the same root value.


Reductions

Suppose we have a large array split across processes
We want to sum all the elements
Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM

Also MPI::COMM_WORLD.Allreduce() variant
Can perform MAX, MIN, MAXLOC, MINLOC too


MPI Reduce

P0 A P0 ABCD

P1 B Reduce P1

P2 C P2

P3 D P3

Reduction operators can be min, max, sum, multiply, logical
ops, max value and location ... Must be associative
(commutative optional)


Scatter and Gather

Split a large array between processes
Use MPI::COMM_WORLD.Scatter()
Each process receives part of the array
Combine small arrays into one large one
Use MPI::COMM_WORLD.Gather()
Designated process will construct entire array
Has MPI::COMM_WORLD.Allgather() variant


MPI Scatter/Gather

P0 A B C D P0 A
Scatter

P1 P1 B

Gather
P2 P2 C

P3 P3 D


MPI Allgather

P0 A P0 A B C D

B Allgather A B C D
P1 P1

P2 C P2 A B C D

P3 D P3 A B C D


MPI Alltoall

P0 A0 A1 A2 A3 P0 A0 B0 C0 D0

B0 B1 B2 B3 Alltoall A1 B1 C1 D1
P1 P1

P2 C0 C1 C2 C3 P2 A2 B2 C2 D2

P3 D0 D1 D2 D3 P3 A3 B3 C3 D3


Asynchronous Messages

An asynchronous API exists too
Have to allocate buffers
Have to check if send or receive has completed
Will give better performance
Trickier to use


User-Deﬁned Datatypes

Usually have complex data structures
Require means of distributing these
Can pack & unpack manually
MPI allows us to deﬁne own datatypes for this


MPI-2

• One-sided RMA (remote memory access) communication

• potential for greater efﬁciency, easier programming.
• Use ”windows” into memory to expose regions for access
• Race conditions now possible.

• Parallel I/O like message passing but to ﬁle system not
other processes.
• Allows for dynamic number of processes and
inter-communicators (as opposed to intra-communicators)
• Cleaned up MPI-1


RMA
• Processors can designate portions of its address space as
available to other processors for read/write operations
(MPI Get, MPI Put, MPI Accumulate).
• RMA window objects created by collective window-creation
fns. (MPI Win create must be called by all participants)
• Before accessing, call MPI Win fence (or other synchr.
mechanisms) to start RMA access epoch; fence (like a barrier)
separates local ops on window from remote ops
• RMA operations are no-blocking; separate synchronization
needed to check completion. Call MPI Win fence again.

RMA window
Put

P0 local memory P1 local memory

MPIMPI Bugs
Sample Bugs

Only works for even number of processors. What’s w rong?


MPIMPI Bugs
Sample Bugs

Only works for even number of processors.


MPI Bugs
Sample MPI Bugs
Suppose you have a local variable “energy” and you want
to sum all the processors “energy” to and wanttotal energy
Supose have local variable, e.g. energy, ﬁnd the to sum all
of the system energy to ﬁnd total energy of the system.
the processors

Recall

MPI_Reduce(sendbuf,recvbuf,count,datatype,op,
root,comm)

hat’s w rong?
Using the same variable, as in
W

MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,
MPI_COMM_WORLD)

Communication Topologies

Some topologies very common
Grid, hypercube etc.
API provided to set up communicators following these


Parallel Performance

Recall Amdahl’s law:
if T1 = serial cost + parallel cost
then
Tp = serial cost + parallel cost/p
But really
Tp = serial cost + parallel cost/p + Tcommunication

How expensive is it?


Network Characteristics

Interconnection network connects nodes, transfers data
Important qualities:
• Topology - the structure used to connect the nodes

• Routing algorithm - how messages are transmitted
between processors, along which path (= nodes along
which message transferred).
• Switching strategy = how message is cut into pieces and
assigned a path
• Flow control (for dealing with congestion) - stall, store data
in buffers, re-route data, tell source to halt, discard, etc.


Represent as graph G = (V , E), V = set of nodes to be
connected, E = direct links between the nodes. Links usually
bidirectional - transfer msg in both directions at same time.
Characterize network by:
• diameter - maximum over all pairs of nodes of the shortest
path between the nodes (length of path in message
transmission)
• degree - number of direct links for a node (number of direct
neighbors)
• bisection bandwidth - minimum number of edges that must
be removed to partition network into two parts of equal size
with no connection between them. (measures network
capacity for transmitting messages simultaneously)
• node/edge connectivity - numbers of node/edges that must
fail to disconnect the network (measure of reliability)

Linear Array

• p vertices, p − 1 links
• Diameter = p − 1
• Degree = 2
• Bisection bandwidth = 1
• Node connectivity = 1, edge connectivity = 1


Ring topology

• diameter = p/2
• degree = 2
• bisection bandwidth = 2
• node connectivity = 2
edge connectivity = 2


Mesh topology

√
• diameter = 2( p − 1)
√
3d mesh is 3( 3 p − 1)
• degree = 4 (6 in 3d )
√
• bisection bandwidth p
• node connectivity 2
edge connectivity 2

Route along each dimension in turn


Torus topology

Diameter halved, Bisection bandwidth doubled,
Edge and Node connectivity doubled over mesh


Hypercube topology
0110 0111

0 1 0010 0011
110 111
1110 1111

10 11 010 011 1010 1011

01001100 1101 0101
100 101
1000 1001

00 01 000 001 0000 0001

• p = 2k processors labelled with binary numbers of length k
• k -dimensional cube constructed from two (k − 1)-cubes
• Connect corresponding procs if labels differ in 1 bit
(Hamming distance d between 2 k -bit binary words =
path of length d between 2 nodes)

Hypercube topology
0110 0111

0 1 0010 0011
110 111
1110 1111

10 11 010 011 1010 1011

01001100 1101 0101
100 101
1000 1001

00 01 000 001 0000 0001

• diameter = k ( =log p)
• degree = k
• bisection bandwidth = p/2
• node connectivity k
edge connectivity k

Dynamic Networks

Above networks were direct, or static interconnection networks
= processors connected directly with each through fixed
physical links.

Indirect or dynamic networks = contain switches which provide
an indirect connection between the nodes. Switches configured
dynamically to establish a connection.
• bus
• crossbar
• multistage network - e.g. butterfly, omega, baseline


Crossbar
P1
P2

Pn

M1 M2 Mm

• Connecting n inputs and m outputs takes nm switches.
(Typically only for small numbers of processors)
• At each switch can either go straight or change dir.
• Diameter = 1, bisection bandwidth = p

Butterﬂy

16 × 16 butterﬂy network:
stage 0 stage 1 stage 2 stage 3
000

001
010
011
100
101

110
111

for p = 2k +1 processors, k + 1 stages, 2k switches per stage,
2 × 2 switches


Fat tree

• Complete binary tree
• Processors at leaves
• Increase links for higher bandwidth near root

Current picture

• Old style: mapped algorithms to topologies
• New style: avoid topology-speciﬁc optimizations
• Want code that runs on next year’s machines too.
• Topology awareness in vendor MPI libraries?
• Software topology - easy of programming, but not used for
performance?


Should we care ?

• Old school: map algorithms to speciﬁc
topologies

• New school: avoid topology-speciﬁc
optimimizations (the code should be optimal
on next year’s infrastructure....)
• Meta-programming / Auto-tuning ?

chart in table format using the statistics page. A direct link to the statistics is also
available.
Top500 Interconnects

Statisti

Top500
06/20

Statisti
Vendo

Genera

Search


MPI References

• Lawrence Livermore tutorial
https:computing.llnl.gov/tutorials/mpi/

• Using MPI
Portable Parallel Programming with the Message=Passing
Interface
by Gropp, Lusk, Skjellum

• Using MPI-2
Advanced Features of the Message Passing Interface
by Gropp, Lusk, Thakur

• Lots of other on-line tutorials, books, etc.


Ignite: Google Trends

http://www.youtube.com/watch?v=m0b-QX0JDXc

MPI with CUDA
MPI and CUDA almost orthogonal
Each node simply becomes faster
Problem matching MPI processes to GPUs
Use compute-exclusive mode on GPUs
Tell cluster environment to limit processes per node
Have to know your cluster documentation


Data Movement

Communication now very expensive
GPUs can only communicate via their hosts
Very laborious
Again: need to minimize communication


MPI Summary

MPI provides cross-platform interprocess
communication
Invariably available on computer clusters
Only need six basic commands to get started
Much more sophistication available


ZeroMQ
• ‘messaging middleware’ ‘TCP on steroids’
‘new layer on the networking stack’
• not a complete messaging system
• just a simple messaging library to be
used programmatically.
• a “pimped” socket interface allowing you to
quickly design / build a complex
communication system without much effort

http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all

ZeroMQ
• Fastest. Messaging. Ever.
• Excellent documentation:
• examples
• white papers for everything
• Bindings for Ada, Basic, C, Chicken Scheme,
Common Lisp, C#, C++, D, Erlang*, Go*,
Haskell*, Java, Lua, node.js, Objective-C, ooc,
Perl, Perl, PHP, Python, Racket, Ruby,Tcl

Message Patterns


Demo: Why ZeroMQ ?

http://www.youtube.com/watch?v=_JCBphyciAs

MPI vs ZeroMQ ?
• MPI is a speciﬁcation, ZeroMQ is an implementation.
• Design:
• MPI is designed for tightly-coupled compute clusters with fast and reliable
networks.
• ZeroMQ is designed for large distributed systems (web-like).
• Fault tolerance:
• MPI has very limited facilities for fault tolerance (the default error handling
behavior in most implementations is a system-wide fail, ouch!).
• ZeroMQ is resilient to faults and network instability.
• ZeroMQ could be a good transport layer for an MPI-like implementation.

http://stackoverﬂow.com/questions/35490/spread-vs-mpi-vs-zeromq

F ast Fo rward

CUDASA

!"#$%&#$"'()*"'#+,

CUDASA: Computed Uniﬁed Device &'9(8:/#1;/(
Systems Architecture
1234J'1/.(8#$'2%,0,$&'3$C,B$'4*BI,#$B#8*$'
78
!"#$%&'()*)++$+,-.'/0'1234'0/*')'-,%5+$'672'#/
-./(01%102 .8+#,9672'-:-#$.-
;<=>'?$-+)>'@8)&*/7+$">'AAA
31#4"56(01%102 672'B+8-#$*'$%C,*/%.$%#-
D:*,%$#>'=%0,%,E)%&>'D:*,FG6>'AAA
1/%-,-#$%#'&$C$+/($*',%#$*0)B$
!)-,+:'$.H$&&$&'#/'#I$'1234'B/.(,+$'(*/B$--

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'()

*&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(;
;-.*-& ?4061,$57$3&)&44(4
-0$60@@+756&.507$?(./((7$;-.*-& ?4061,
#7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6(
-0$60=($@0=5A56&.507$)(E+5)(=
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'()

*8#9$-./'0'1./'0) -./'234"':;0,.<
=7($*8#$(>+&4,$07($"=?@A$.;)(&B
(%#; C4061,$57$3&)&44(4
244$C4061,$,;&)($60DD07$,',.(D$D(D0)'$:(>EF$G#H2$I40C&4$D(D0)'<
J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'()

*8#9$#+-./%'0/1#$%*'23':70;(<
=7($*8#$(>+&4,$07($?"@$3)06(,,
?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4
-0$57.)57,56$A40B&4$C(C0)'$ D5,.)5B+.(;$,E&)(;$C(C0)'$C&7&A(C(7.
F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%&'()%*)+%,

233456&.507 -(./0)1$%&'() *+,$%&'() !"#$%&'()

8(9+(7.5&4$&33456&.507$3)06(,,
2):5.)&)'$;<;==$&33456&.507$60>(
24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)'
B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+

!"#$%&'(
)#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7
;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71
8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4
;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1
97",7+44+<5:5'2

A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"#

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'()*+!+',-.&/
!"#$%&'()*+',-.&/
!!"#$%&#!!'($)*'"+,-./0#$&12'3&4&561647'8999:';;'<=>?@=
925,34-5'
:2"%464&8 !!A$B1!!'($)*'A+,-./9997'8 ;;'CDEF
999'
01&,234-5' "+,-.'GGG'<"H'<%H'IB'JJJ/3&4&561647K
(-564728"34-5
:

!!1&BL!! ($)*'1+,-./0#$&12'3&4&561647'8 ;;'CDEF
=&>
A+,-./3&4&561647K
925,34-5'
:2"%464&8 :
!!B6M,6-.6!! ($)*'B+,-./9997'8 ;;'NOO
;&5&8"%4<&. 1+,-.'GGG'<"'JJJ/3&4&561647K
01&,234-5'
(-564728"34-5 :

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :

!"#$%"$&'()*&#+,-#+'./-#*01

!"#$%&'$()* +,-)#./ 0*$.%*&1 23(1$4(*#
&--1('&$()*51&6.% !!"#$%#&'#!!

*.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012

"3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"5012

9:;51&6.% !!78)*48!! !!+#91'#!! 7:1+012./*8)'5,+-./
*8)'5012./36:#4+,+-

+,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()*
23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&

!"#$%&'()%*+,+-
;)< ,#.((%#= /01'."2%#3
4.2'(567/(+'8"'/$(#'9(9:+;2:,.
<%/$+%*"$'.(/1,+'.(&'&:+-(=:+(#'$9:+;(2,-'+

>:&&:#(%#$'+=,0'(="#0$%:#/(?'@3@(,$:&%0(="#0$%:#/A

>7<BCB(>:&D%2'+
>:.'($+,#/2,$%:#(=+:&(>7<BCB(0:.'($:(>7<B(9%$1($1+',./EFG4
C'2=H0:#$,%#'.(D+'H0:&D%2'+($:(>7<B(0:&D%2'+(D+:0'//
5,/'.(:#(62/,(?>II(D,+/'+A(9%$1(,..'.(>7<BE>7<BCB(="#0$%:#,2%$-
J"22(,#,2-/%/(:=(/-#$,K(,#.(/'&,#$%0/(+'8"%+'.

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::

!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1
!!"#$%!!&'()*&"+,-./)-"&)0&12(#"&314&5&666&7

"89:*:1&$";,."&5

;/'-<+'=0.'+ )-"&)<&12(#"&31<&
>:0&,<0./ *)=>&"#$%?*@0&"#$%A)=<
7&B;#99:;!$";,."!"+,-.<

23456(,7-'#+( '()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5
)*$%#,08&'( )-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1<
/09.#,:-'
*)=>&"#$%?*@&C&9#;#=DE"#$%?*@<
*)=>&"#$%A)=&C&9#;#=DE"#$%A)=<
3-090.#&(
5&666&7
=:.),0*.(8*+?
7

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-./,0(1%2

!"#$%&'()*+'(,-#./&#,0*.1
'/012#333#%&#4445!"#$67

2'.'-#,'+()*+'(&#3*4,
5*%3(64.),0*.(%#-#$','-/(0.,*(7-#%%'-(/,-4), 8 !"#$#9
:*%4&#,'(/);'+4&'-(<4'4'(70,;(#&&(=&*)>/(*6(,;'((%#<9."=+ 8 %&#9
?','-$0.'(=40&,@0./(6*-('#);(=&*)> 8 '()*+,-"#'()*%!. 9
A#>'(4%(B!C(7*->'-(,;-'#+/(6-*$(,;'(,;-'#+(%**&
D+&'(B!C/(-'<4'/,(.'",(%'.+0.E(=&*)>(6-*$(<4'4'
A#0,(6*-(#&&(=&*)>/(,*(='(%-*)'//'+
D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'()*+(&,")
-)(+".%&"%/0*%+(1$'%0*,).%234%,)&$'5(6$
789*%'0)%6":;,+$<&,:$%.$)$'(&$#%$=$)&%+"";%>$?=@%&"%&A'$(#%;""+B
-;;+,6(&,")%,**0$*%/'"(#6(*&%:$**(.$*%&"
,**0$%$C$60&,")%<)=+8$*/(")*#
;$'5"':%*A('$#%#,*&',/0&$#%:$:"'1%";$'(&,")*

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%"%&'(')*&+,-%'.*/0,1(2%/

!"#$%&'&()*$(+,*%&'-%-.$/
01#+2%'3.-4,*#*(.1)'%53%%&(16')/)*%-'-%-.$/'.7')(162%'1.&%
0#3"'32,)*%$'1.&%'&%&(3#*%)'4#$*'.7')/)*%-'-%-.$/'7.$'8!9
:.1*(1,.,)';($*,#2'#&&$%))'$#16%
<1*%$7#3%';(#'!"#$%$&$''(!)*!"#$%$&+,!-.)*!"#$%$/0++
0;=>'*.':?8@'62.+#2'-%-.$/'-#1#6%-%1*
<-42%-%1*%&',)(16'9A<'B%-.*%'9%-.$/'@33%))'CB9@D
E.'6,#$#1*(%)'7.$'3.13,$$%1*'1.1F#*.-(3'-%-.$/'#33%))%)'C#)'(1':?8@D
:?8@!@'#*.-(3'7,13*(.1)'7.$'8!9

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()$#(*+,+%%"%-#.

!"#$%&'#"%()%*+,-%."/"0'1%2'$034%251$3617%8-9:;;<
*531=%>/%$>6%>?%@A*+,-%-9:;;%13B0'07%?5/&$3>/
*1>&CDB'#"=%#5BD2'$034%60>&"##3/.%'6613"=%>/%'11%1"E"1#%>?%6'0'11"13#2

@FA)
,;G%H6$"0>/%IJKL%I4I%&>0"#
M/$"1%NOOKKL%P%&>0"#

9FA)
QRMGM,%N5'=0>%STUOKK

QRMGM,%VVKK9!T%A1$0'
I%&'0=#% 8(OW(O%1'/"#<
QRMGM,%VVKK9!
I%&'0=#% 8(OWP%1'/"#<
X%&'0=#% 8(OWPWP%1'/"#<
P%&'0=#% 8(OWPWPWP%1'/"#<

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()$#(*+,+%%"%-#.(/012&34

!"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$506
1<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")*
78%9'&:#,'&:"/%"$%'18;%(<<=

>'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?#
A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E

J+2,"/%03%
C%G>A (%G>A# M%G>A#
#&"6"%"1"2"6$#
N(=OD P(O (ON C(Q
CNC<=( C<N< P(< (OP
F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&#'()"&*+,-(./,/%%"%0#1

!"##$%#$%&#'&()*+%,-.//%012'3%'(*4#"5%)6"677(7218
9(1*%&61(%:+
;%&701*("%'#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("'(*
;B???; 86*"2&(1% 234(56%+7#
<)= &#8)0*6*2#'+% :CD1%
<)= &#880'2&6*2#'+ ECF1
G23@%&#880'2&6*2#'%&#1*1

,2'37(%!H%42*@%E%-!I1%01(<%61%E%12'37(%-!I%&701*("%'#<(1
:?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#'7L+%M:E%-N7#)1O
P(Q02"(1%2'*("R)"#&(11%&#880'2&6*2#'
S,/%6&&(11(1%*65(%:CB%*28(1%7#'3("%*@6'%&#8)0*6*2#'
T#%646"('(11%#$%<6*6%7#&672*L
G23@%J0''(&(116"LO%&#880'2&6*2#'%#K("@(6<

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

!"#$%&'("#

!"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-;
<.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+
?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@
B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4
(-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,-
(2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+--

!611+,*':1/D+=*'-*2*+&
()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C
E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")*
<.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@
$6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"-
G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+

!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;

F ast Fo rward

MultiGPU MapReduce

MapReduce

http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png

Why MapReduce?

• Simple programming model

• Parallel programming model

• Scalable

• Previous GPU work: neither multi-GPU nor out-of-core

Benchmarks—Which
• Matrix Multiplication (MM)

• Word Occurrence (WO)

• Sparse-Integer Occurrence (SIO)

• Linear Regression (LR)

• K-Means Clustering (KMC)

• (Volume Renderer—presented 90
minutes ago @ MapReduce ’10)

Benchmarks—Why
• Needed to stress aspects of GPMR

• Unbalanced work (WO)

• Multiple emits/Non-uniform number of emits (LR, KMC,
WO)

• Sparsity of keys (SIO)

• Accumulation (WO, LR, KMC)

• Many key-value pairs (SIO)

• Compute Bound Scalability (MM)

WO. We test GPMR against all available input sets.

Benchmarks—Results
MM KMC LR SIO WO
1-GPU Speedup 162.712 2.991 1.296 1.450 11.080
4-GPU Speedup 559.209 11.726 4.085 2.322 18.441
vs. CPU
TABLE 2: Speedup for GPMR over Phoenix on our large (second-
biggest) input data from our ﬁrst set. The exception is MM, for which
we use our small input set (Phoenix required almost twenty seconds TAB
to multiply two 1024 × 1024 matrices). writ
all
littl
boil
MM KMC WO func
GPM
1-GPU Speedup 2.695 37.344 3.098 of t
4-GPU Speedup 10.760 129.425 11.709
vs. GPU
TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix
Multiplication, an 8M-point K-Means Clustering, and a 512 MB
Word Occurrence. These sizes represent the largest problems that
can meet the in-core memory requirements of Mars.

Benchmarks - Results

Good

one more thing
or two...

Life/Code Hacking #3
The Pomodoro Technique

Life/Code Hacking #3
The Pomodoro Technique

http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions

http://www.youtube.com/watch?v=QYyJZOHgpco

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à [Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)

Similaire à [Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ) (20)

Plus de npinto

Plus de npinto (16)

Dernier

Dernier (20)

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)