2. Administrativia
• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11
• Project info: http://www.cs264.org/projects/projects.html
• Project ideas: http://forum.cs264.org/index.php?board=6.0
• Project proposal deadline: Fri 3/25/11
(but you should submit way before to start working on it asap)
• Need a private private repo for your project?
Let us know! Poll on the forum:
http://forum.cs264.org/index.php?topic=228.0
4. Goodies (cont’d)
• Amazon AWS free credits coming soon
(only for students who completed HW0+1)
• It’s more than $14,000 donation for the class!
• Special thanks: Kurt Messersmith @ Amazon
5. Goodies (cont’d)
• Best Project Prize: Tesla C2070 (Fermi) Board
• It’s more than $4,000 donation for the class!
• Special thanks:
David Luebke & Chandra Cheij @ NVIDIA
6. During this course,
r CS264
adapted fo
we’ll try to
“ ”
and use existing material ;-)
10. The Problem
Many computational problems too big for single CPU
Lack of RAM
Lack of CPU cycles
Want to distribute work between many CPUs
slide by Richard Edgar
11. Types of Parallelism
Some computations are ‘embarrassingly parallel’
Can do a lot of computation on minimal data
RC5 DES, SETI@HOME etc.
Solution is to distribute across the Internet
Use TCP/IP or similar
slide by Richard Edgar
12. Types of Parallelism
Some computations very tightly coupled
Have to communicate a lot of data at each step
e.g. hydrodynamics
Internet latencies much too high
Need a dedicated machine
slide by Richard Edgar
13. Tightly Coupled Computing
Two basic approaches
Shared memory
Distributed memory
Each has advantages and disadvantages
slide by Richard Edgar
14. dvariables.
variables.
uted memory private memory for each processor, only acces
uted memory private memory for each processor, only acce
Some terminology
ocessor, so no synchronization for memory accesses neede
ocessor, so no synchronization for memory accesses neede
mationexchanged by sending data from one processor to ano
ation exchanged by sending data from one processor to an
interconnection network using explicit communication opera
interconnection network using explicit communication opera
M
M M
M M
M PP PP PP
PP PP PP
Interconnection Network
Interconnection Network
Interconnection Network
Interconnection Network M
M M
M M
M
“distributed memory”
approach increasingly common “shared memory”
d approach increasingly common
now: mostly hybrid
15. variables.
uted memory private memory for each processor, only acces
Some terminology
ocessor, so no synchronization for memory accesses neede
ation exchanged by sending data from one processor to ano
interconnection network using explicit communication opera
interconnection network using explicit communication opera
M
M M
M M
M P P P
PP PP PP
Interconnection Network
Interconnection Network
Interconnection Network M M M
“distributed memory”
approach increasingly common “shared memory”
now: mostly hybrid
16. Shared Memory Machines
Have lots of CPUs share the same memory banks
Spawn lots of threads
Each writes to globally shared memory
Multicore CPUs now ubiquitous
Most computers now ‘shared memory machines’
slide by Richard Edgar
17. Shared Memory Machines
NASA ‘Columbia’ Computer
Up to 2048 cores in single system
slide by Richard Edgar
18. Shared Memory Machines
Spawning lots of threads (relatively) easy
pthreads, OpenMP
Don’t have to worry about data location
Disadvantage is memory performance scaling
Frontside bus saturates rapidly
Can use Non-Uniform Memory Architecture (NUMA)
Silicon Graphics Origin & Altix series
Gets expensive very fast
slide by Richard Edgar
19. d variables.
uted memory private memory for each processor, only acce
Some terminology
ocessor, so no synchronization for memory accesses neede
mation exchanged by sending data from one processor to an
interconnection network using explicit communication opera
M M M PP PP PP
P P P
Interconnection Network
Interconnection Network
Interconnection Network M M M
M M M
“distributed memory” “shared memory”
d approach increasingly common
now: mostly hybrid
20. Distributed Memory Clusters
Alternative is a lot of cheap machines
High-speed network between individual nodes
Network can cost as much as the CPUs!
How do nodes communicate?
slide by Richard Edgar
22. Distributed Memory Model
Communication is key issue
Each node has its own address space
(exclusive access, no global memory?)
Could use TCP/IP
Painfully low level
Solution: a communication protocol like message-
passing (e.g. MPI)
slide by Richard Edgar
23. Distributed Memory Model
All data must be explicitly partitionned
Exchange of data by explicit communication
slide by Richard Edgar
25. Message Passing Interface
MPI is a communication protocol for parallel programs
Language independent
Open standard
Originally created by working group at SC92
Bindings for C, C++, Fortran, Python, etc.
http://www.mcs.anl.gov/research/projects/mpi/
http://www.mpi-forum.org/
slide by Richard Edgar
26. Message Passing Interface
MPI processes have independent address spaces
Communicate by sending messages
Means of sending messages invisible
Use shared memory if available! (i.e. can be used
behind the scenes shared memory architectures)
On Level 5 (Session) and higher of OSI model
slide by Richard Edgar
28. Message Passing Interface
MPI is a standard, a specification, for message-passing
libraries
Two major implementations of MPI
MPICH
OpenMPI
Programs should work with either
slide by Richard Edgar
29. Basic Idea
• Usually programmed with SPMD model (single program,
multiple data)
• In MPI-1 number of tasks is static - cannot dynamically
spawn new tasks at runtime. Enhanced in MPI-2.
• No assumptions on type of interconnection network; all
processors can send a message to any other processor.
• All parallelism explicit - programmer responsible for
correctly identifying parallelism and implementing parallel
algorithms
adapted from Berger & Klöckner (NYU 2010)
31. Hello World
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello world from %d of %dn", rank, size);
MPI_Finalize();
return 0;
}
adapted from Berger & Klöckner (NYU 2010)
32. Hello World
To compile: Need to load “MPI” wrappers in addition to the
compiler modules (OpenMPI,‘ MPICH,...)
module load mpi/openmpi/1.2.8/gnu
module load openmpi/intel/1.3.3
To compile: mpicc hello.c
To run: need to tell how many processes you are requesting
mpiexec -n 10 a.out (mpirun -np 10 a.out)
adapted from Berger & Klöckner (NYU 2010)
33. The beauty of data
visualization
http://www.youtube.com/watch?v=pLqjQ55tz-U
34. The beauty of data
visualization
http://www.youtube.com/watch?v=pLqjQ55tz-U
39. Basic MPI
MPI is a library of routines
Bindings exist for many languages
Principal languages are C, C++ and Fortran
Python: mpi4py
We will discuss C++ bindings from now on
http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm
slide by Richard Edgar
40. Basic MPI
MPI allows processes to exchange messages
Processes are members of communicators
Communicator shared by all is MPI::COMM_WORLD
In C++ API, communicators are objects
Within a communicator, each process has unique ID
slide by Richard Edgar
41. A Minimal MPI Program
#include <iostream>
using namespace std;
#include “mpi.h”
int main( int argc, char* argv ) { Very much a minimal
program
MPI::Init( argc, argv );
No actual
cout << “Hello World!” << endl;
communication occurs
MPI::Finalize();
return( EXIT_SUCCESS );
}
slide by Richard Edgar
42. A Minimal MPI Program
To compile MPI programs use mpic++
mpic++ -o MyProg myprog.cpp
The mpic++ command is a wrapper for default compiler
Adds in libraries
Use mpic++ --show to see what it does
Will also find mpicc, mpif77 and mpif90 (usually)
slide by Richard Edgar
43. A Minimal MPI Program
To run the program, use mpirun
mpirun -np 2 ./MyProg
The -np 2 option launches two processes
Check documentation for your cluster
Number of processes might be implicit
Program should print “Hello World” twice
slide by Richard Edgar
44. Communicators
Processes are members of communicators
A process can
Find the size of a given communicator
Determine its ID (or rank) within it
Default communicator is MPI::COMM_WORLD
slide by Richard Edgar
45. Communicators
int nProcs, iMyProc;
MPI::Init( argc, argv ); Queries COMM_WORLD communicator for
nProcs = MPI::COMM_WORLD.Get_size(); Number of processes
iMyProc = MPI::COMM_WORLD.Get_rank();
Current process rank (ID)
cout << “Hello from process ”;
cout << iMyProc << “ of ”; Prints these out
cout << nProcs << endl; Process rank counts from zero
MPI::Finalize();
slide by Richard Edgar
46. Communicators
By convention, process with rank 0 is master
const int iMasterProc = 0;
Can have more than one communicator
Process may have different rank within each
slide by Richard Edgar
47. Messages
Haven’t sent any data yet
Communicators have Send and Recv methods for this
One process posts a Send
Must be matched by Recv in the target process
slide by Richard Edgar
48. Sending Messages
A sample send is as follows:
int a[10];
MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );
The method prototype is
void Comm::Send( const void* buf, int count,
const Datatype& datatype,
int dest, int tag) const
MPI copies the buffer into a system buffer and returns
No delivery notification
slide by Richard Edgar
49. Receiving Messages
Similar call to receive MPI::ANY_SOURCE
int a[10];
MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, MPI::ANY_TAG
iMyTag);
Function prototype is
void Comm::Recv( void* buf, int count,
const Datatype& datatype,
int source, int tag) const
Blocks until data arrives
slide by Richard Edgar
50. MPI Datatypes
MPI Datatype C/C++
MPI datatypes are MPI::CHAR signed char
independent of MPI::SHORT signed short
Language MPI::INT signed int
Endianess MPI::LONG signed long
Most common listed MPI::FLOAT float
opposite MPI::DOUBLE double
MPI::BYTE Untyped byte data
slide by Richard Edgar
51. MPI Send & Receive
if( iMyProc == iMasterProc ) {
for( int i=1; i<nProcs; i++ ) {
int iMessage = 2 * i + 1;
cout << “Sending ” << iMessage <<
“ to process ” << i << endl;
MPI::COMM_WORLD.Send( &iMessage, 1, Master process sends
MPI::INT,
i, iTag ); out numbers
}
} else {
int iMessage; Worker processes print
MPI::COMM_WORLD.Recv( &iMessage, 1,
MPI::INT,
out number received
iMasterProc, iTag );
cout << “Process ” << iMyProc <<
“ received ” << iMessage << endl;
}
slide by Richard Edgar
52. Six Basic MPI Routines
Have now encounted six MPI routines
MPI::Init(), MPI::Finalize()
MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),
MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()
These are enough to get started ;-)
More sophisticated routines available...
slide by Richard Edgar
53. Collective Communications
Send and Recv are point-to-point
Communicate between specific processes
Sometimes we want all processes to exchange data
These are called collective communications
slide by Richard Edgar
54. Barriers
Barriers require all processes to synchronise
MPI::COMM_WORLD.Barrier();
Processes wait until all processes arrive at barrier
Potential for deadlock
Bad for performance
Only use if necessary
slide by Richard Edgar
55. Broadcasts
Suppose one process has array to be shared with all
int a[10];
MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );
If process has rank iSrcProc, it will send the array
Other processes will receive it
All will have a[10] identical to iSrcProc on completion
slide by Richard Edgar
56. MPI Broadcast
P0 A P0 A
Broadcast A
P1 P1
P2 P2 A
P3 P3 A
MPI Bcast(&buf, count, datatype, root, comm)
All processors must call MPI Bcast with the same root value.
adapted from Berger & Klöckner (NYU 2010)
57. Reductions
Suppose we have a large array split across processes
We want to sum all the elements
Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM
Also MPI::COMM_WORLD.Allreduce() variant
Can perform MAX, MIN, MAXLOC, MINLOC too
slide by Richard Edgar
58. MPI Reduce
P0 A P0 ABCD
P1 B Reduce P1
P2 C P2
P3 D P3
Reduction operators can be min, max, sum, multiply, logical
ops, max value and location ... Must be associative
(commutative optional)
adapted from Berger & Klöckner (NYU 2010)
59. Scatter and Gather
Split a large array between processes
Use MPI::COMM_WORLD.Scatter()
Each process receives part of the array
Combine small arrays into one large one
Use MPI::COMM_WORLD.Gather()
Designated process will construct entire array
Has MPI::COMM_WORLD.Allgather() variant
slide by Richard Edgar
60. MPI Scatter/Gather
P0 A B C D P0 A
Scatter
P1 P1 B
Gather
P2 P2 C
P3 P3 D
adapted from Berger & Klöckner (NYU 2010)
61. MPI Allgather
P0 A P0 A B C D
B Allgather A B C D
P1 P1
P2 C P2 A B C D
P3 D P3 A B C D
adapted from Berger & Klöckner (NYU 2010)
63. Asynchronous Messages
An asynchronous API exists too
Have to allocate buffers
Have to check if send or receive has completed
Will give better performance
Trickier to use
slide by Richard Edgar
64. User-Defined Datatypes
Usually have complex data structures
Require means of distributing these
Can pack & unpack manually
MPI allows us to define own datatypes for this
slide by Richard Edgar
65. MPI-2
• One-sided RMA (remote memory access) communication
• potential for greater efficiency, easier programming.
• Use ”windows” into memory to expose regions for access
• Race conditions now possible.
• Parallel I/O like message passing but to file system not
other processes.
• Allows for dynamic number of processes and
inter-communicators (as opposed to intra-communicators)
• Cleaned up MPI-1
adapted from Berger & Klöckner (NYU 2010)
66. RMA
• Processors can designate portions of its address space as
available to other processors for read/write operations
(MPI Get, MPI Put, MPI Accumulate).
• RMA window objects created by collective window-creation
fns. (MPI Win create must be called by all participants)
• Before accessing, call MPI Win fence (or other synchr.
mechanisms) to start RMA access epoch; fence (like a barrier)
separates local ops on window from remote ops
• RMA operations are no-blocking; separate synchronization
needed to check completion. Call MPI Win fence again.
RMA window
Put
P0 local memory P1 local memory
adapted from Berger & Klöckner (NYU 2010)
68. MPIMPI Bugs
Sample Bugs
Only works for even number of processors. What’s w rong?
adapted from Berger & Klöckner (NYU 2010)
69. MPIMPI Bugs
Sample Bugs
Only works for even number of processors.
adapted from Berger & Klöckner (NYU 2010)
70. MPI Bugs
Sample MPI Bugs
Suppose you have a local variable “energy” and you want
to sum all the processors “energy” to and wanttotal energy
Supose have local variable, e.g. energy, find the to sum all
of the system energy to find total energy of the system.
the processors
Recall
MPI_Reduce(sendbuf,recvbuf,count,datatype,op,
root,comm)
hat’s w rong?
Using the same variable, as in
W
MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,
MPI_COMM_WORLD)
adapted from Berger & Klöckner (NYU 2010)
72. Communication Topologies
Some topologies very common
Grid, hypercube etc.
API provided to set up communicators following these
slide by Richard Edgar
73. Parallel Performance
Recall Amdahl’s law:
if T1 = serial cost + parallel cost
then
Tp = serial cost + parallel cost/p
But really
Tp = serial cost + parallel cost/p + Tcommunication
How expensive is it?
adapted from Berger & Klöckner (NYU 2010)
74. Network Characteristics
Interconnection network connects nodes, transfers data
Important qualities:
• Topology - the structure used to connect the nodes
• Routing algorithm - how messages are transmitted
between processors, along which path (= nodes along
which message transferred).
• Switching strategy = how message is cut into pieces and
assigned a path
• Flow control (for dealing with congestion) - stall, store data
in buffers, re-route data, tell source to halt, discard, etc.
adapted from Berger & Klöckner (NYU 2010)
75. Interconnection Network
Represent as graph G = (V , E), V = set of nodes to be
connected, E = direct links between the nodes. Links usually
bidirectional - transfer msg in both directions at same time.
Characterize network by:
• diameter - maximum over all pairs of nodes of the shortest
path between the nodes (length of path in message
transmission)
• degree - number of direct links for a node (number of direct
neighbors)
• bisection bandwidth - minimum number of edges that must
be removed to partition network into two parts of equal size
with no connection between them. (measures network
capacity for transmitting messages simultaneously)
• node/edge connectivity - numbers of node/edges that must
fail to disconnect the network (measure of reliability)
adapted from Berger & Klöckner (NYU 2010)
76. Linear Array
• p vertices, p − 1 links
• Diameter = p − 1
• Degree = 2
• Bisection bandwidth = 1
• Node connectivity = 1, edge connectivity = 1
adapted from Berger & Klöckner (NYU 2010)
78. Mesh topology
√
• diameter = 2( p − 1)
√
3d mesh is 3( 3 p − 1)
• degree = 4 (6 in 3d )
√
• bisection bandwidth p
• node connectivity 2
edge connectivity 2
Route along each dimension in turn
adapted from Berger & Klöckner (NYU 2010)
79. Torus topology
Diameter halved, Bisection bandwidth doubled,
Edge and Node connectivity doubled over mesh
adapted from Berger & Klöckner (NYU 2010)
80. Hypercube topology
0110 0111
0 1 0010 0011
110 111
1110 1111
10 11 010 011 1010 1011
01001100 1101 0101
100 101
1000 1001
00 01 000 001 0000 0001
• p = 2k processors labelled with binary numbers of length k
• k -dimensional cube constructed from two (k − 1)-cubes
• Connect corresponding procs if labels differ in 1 bit
(Hamming distance d between 2 k -bit binary words =
path of length d between 2 nodes)
adapted from Berger & Klöckner (NYU 2010)
82. Dynamic Networks
Above networks were direct, or static interconnection networks
= processors connected directly with each through fixed
physical links.
Indirect or dynamic networks = contain switches which provide
an indirect connection between the nodes. Switches configured
dynamically to establish a connection.
• bus
• crossbar
• multistage network - e.g. butterfly, omega, baseline
adapted from Berger & Klöckner (NYU 2010)
83. Crossbar
P1
P2
Pn
M1 M2 Mm
• Connecting n inputs and m outputs takes nm switches.
(Typically only for small numbers of processors)
• At each switch can either go straight or change dir.
• Diameter = 1, bisection bandwidth = p
adapted from Berger & Klöckner (NYU 2010)
84. Butterfly
16 × 16 butterfly network:
stage 0 stage 1 stage 2 stage 3
000
001
010
011
100
101
110
111
for p = 2k +1 processors, k + 1 stages, 2k switches per stage,
2 × 2 switches
adapted from Berger & Klöckner (NYU 2010)
85. Fat tree
• Complete binary tree
• Processors at leaves
• Increase links for higher bandwidth near root
adapted from Berger & Klöckner (NYU 2010)
86. Current picture
• Old style: mapped algorithms to topologies
• New style: avoid topology-specific optimizations
• Want code that runs on next year’s machines too.
• Topology awareness in vendor MPI libraries?
• Software topology - easy of programming, but not used for
performance?
adapted from Berger & Klöckner (NYU 2010)
87. Should we care ?
• Old school: map algorithms to specific
topologies
• New school: avoid topology-specific
optimimizations (the code should be optimal
on next year’s infrastructure....)
• Meta-programming / Auto-tuning ?
88. chart in table format using the statistics page. A direct link to the statistics is also
available.
Top500 Interconnects
Statisti
Top500
06/20
Statisti
Vendo
Genera
Search
adapted from Berger & Klöckner (NYU 2010)
89. MPI References
• Lawrence Livermore tutorial
https:computing.llnl.gov/tutorials/mpi/
• Using MPI
Portable Parallel Programming with the Message=Passing
Interface
by Gropp, Lusk, Skjellum
• Using MPI-2
Advanced Features of the Message Passing Interface
by Gropp, Lusk, Thakur
• Lots of other on-line tutorials, books, etc.
adapted from Berger & Klöckner (NYU 2010)
92. MPI with CUDA
MPI and CUDA almost orthogonal
Each node simply becomes faster
Problem matching MPI processes to GPUs
Use compute-exclusive mode on GPUs
Tell cluster environment to limit processes per node
Have to know your cluster documentation
slide by Richard Edgar
93. Data Movement
Communication now very expensive
GPUs can only communicate via their hosts
Very laborious
Again: need to minimize communication
slide by Richard Edgar
94. MPI Summary
MPI provides cross-platform interprocess
communication
Invariably available on computer clusters
Only need six basic commands to get started
Much more sophistication available
slide by Richard Edgar
97. ZeroMQ
• ‘messaging middleware’ ‘TCP on steroids’
‘new layer on the networking stack’
• not a complete messaging system
• just a simple messaging library to be
used programmatically.
• a “pimped” socket interface allowing you to
quickly design / build a complex
communication system without much effort
http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
101. MPI vs ZeroMQ ?
• MPI is a specification, ZeroMQ is an implementation.
• Design:
• MPI is designed for tightly-coupled compute clusters with fast and reliable
networks.
• ZeroMQ is designed for large distributed systems (web-like).
• Fault tolerance:
• MPI has very limited facilities for fault tolerance (the default error handling
behavior in most implementations is a system-wide fail, ouch!).
• ZeroMQ is resilient to faults and network instability.
• ZeroMQ could be a good transport layer for an MPI-like implementation.
http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq
128. WO. We test GPMR against all available input sets.
Benchmarks—Results
MM KMC LR SIO WO
1-GPU Speedup 162.712 2.991 1.296 1.450 11.080
4-GPU Speedup 559.209 11.726 4.085 2.322 18.441
vs. CPU
TABLE 2: Speedup for GPMR over Phoenix on our large (second-
biggest) input data from our first set. The exception is MM, for which
we use our small input set (Phoenix required almost twenty seconds TAB
to multiply two 1024 × 1024 matrices). writ
all
littl
boil
MM KMC WO func
GPM
1-GPU Speedup 2.695 37.344 3.098 of t
4-GPU Speedup 10.760 129.425 11.709
vs. GPU
TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix
Multiplication, an 8M-point K-Means Clustering, and a 512 MB
Word Occurrence. These sizes represent the largest problems that
can meet the in-core memory requirements of Mars.