SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
History and Development of the
MPI Standard
Jesper Larsson Träff
Vienna University of Technology
Faculty of Informatics, Institute of Information Systems
Research Group Parallel Computing
Favoritenstrase 16, 1040 Wien
www.par.tuwien.ac.at
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
21. September, 2012, MPI Forum meeting in Vienna:
MPI 3.0 has just been released …
… but MPI has a long history and it is instructive to
look at that
www.mpi-forum.org
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
„Those who cannot remember the past are condemned to repeat
it”, George Santyana, The Life of Reason, 1905-1906
“History always repeats itself twice: first time as tragedy,
second time as farce”, Karl Marx
“History is written by the winners”, George Orwell, 1944 (but
he quotes from someone else)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Last quote:
„history“ depends. Who tells it, and why? What informations is
available? What‘s at stake?
My stake:
•Convinced of MPI as a well designed and extremely useful
standard, that has posed productive research/development
problems, with a broader parallel computing relevance
•Critical of current standardization effort, MPI 3.0
•MPI implementer, 2000-2010 with NEC
•MPI Forum member 2008-2010 (with Hubert Ritzdorf,
representing NEC)
•Voted „no“ to MPI 2.2
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
A long debate: shared-memory vs. distributed memory
Question: What shall a parallel machine look like?
M
P P P P
Answer depends
•What are your concerns?
•What is desirable?
•What is feasible?
causing debate since (at least) the 70ties, 80ties
?
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Hoare/Dijkstra:
Parallel programs shall be structured as collections of
communicating, sequential processes
Their concern: CORRECTNESS
Wyllie, Vishkin:
A parallel algorithm is like a collection of synchronized sequential
algorithms that access a common shared memory, and the machine
is a PRAM
Their concern: (asymptotic) PERFORMANCE
And, of course, PERFORMANCE: many, many practitioneers
And, of course, CORRECTNESS: Hoare semantics
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Hoare/Dijkstra:
Parallel programs shall be structured as collections of
communicating, sequential processes
Wyllie, Vishkin:
A parallel algorithm is like a collection of synchronized sequential
algorithms that access a common shared memory, and the machine
is a PRAM
[Fortune, Wyllie: Parallelism in Random Access Machines. STOC
1978: 114-118]
[Shiloach, Vishkin: Finding the Maximum, Merging, and Sorting in a
Parallel Computation Model. Jour. Algorithms 2(1): 88-102, 1981]
[C. A. R. Hoare: Communicating Sequential Processes. Comm. ACM
21(8): 666-677, 1978]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Hoare/Dijkstra:
Parallel programs shall be structured as collections of
communicating, sequential processes
Wyllie, Vishkin: (many, many practiotioneers, Burton-Smith, …)
A parallel algorithm is like a collection of synchronized sequential
algorithms that access a common shared memory, and the machine
is a PRAM
M
P P P P
Neither perhaps cared
too much about how to
build machines…
Neither perhaps cared
too much about how to
build machines (in the
beginning)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The INMOS transputer T400, T800, from
ca. 1985
…but others (fortunately) did
A complete architecture entirely based on
the CSP idea. An original programming
language, OCCAM (1983, 1987)
Parsytec (ca. 1988-1995)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Intel iPSC/2 ca. 1990
Intel Paragon, ca. 1992
IBM SP/2 ca. 1996
Thinking machines
CM5, ca. 1994
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MasPar (1987.1996)
MP2
Thinking Machines (1982-94) CM2, CM5
KSR 2, ca. 1992
HEP Denelcor,
mid 1980ties
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Ironically…
Despite algorithmically stronger properties and potential for
scaling to much, much larger numbers of processors of shared-
memory models (like the PRAM)
practically, high-performance systems with (quite) substantial
parallelism have all been distributed-memory systems
and the corresponding de facto standard – MPI (the
Message-Passing Interface) is much stronger
than (say) OpenMP
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Sources of MPI: the early years
Commercial vendors and national laboratories (including
many European) needed practically working programming
support for their machines and applications
Early 90ties fruitful years for practical parallel computing
(funding for „grand challenge“ and „star wars“)
Vendors and labs proposed and maintained own languages,
interfaces, libraries for parallel programming (early 90ties)
•Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL,
PVM, P4, OCCAM, …
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
•Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL,
PVM, P4, OCCAM, …
intended for distributed memory machines, and centered around
similar concepts
Similar enough to warrant an effort towards creating a common
standard for message-passing based parallel programming
Portability problem: wasted effort in maintaining own interface
for small user group, lack of portability across systems
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
•Intel NX: send-receive message passing (non-blocking,
buffering?), tags(tag groups?), no group concept, some
collectives, weak encapsulation
•IBM EUI: point-to-point and collectives (more than in MPI),
group concept, high performance (??) [Snir et al.]
•IBM CCL: point-to-point and collectives, encapsulation
•Zipcode/Express: point-to-point, emphasis on library building
[Skjellum]
•PARMACS/Express: point-to-point, topological mapping [Hempel]
•PVM: point-to-point communication, some collective, virtual
machine abstraction, fault-tolerance
Message-passing interfaces/languages early 90ties
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Some odd men out
•Linda: tuple space get/put – a first PGAS approach?
•Active messages; seems to presuppose an SPMD model?
•OCCAM: too strict CSP-based, synchronous message passing?
•PVM: heterogeneous systems, fault-tolerance, …
[Hempel, Hey, McBryan, Walker: Special Issue – Message Passing
Interfaces. Parallel Computing 29(4), 1994]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Standardization: the MPI Forum and MPI 1.0
[Hempel, Walker: The emergence of the MPI message passing standard
for parallel computing. Computer Standards & Interfaces, 21: 51-62,
1999]
A standardization effort was started early 1992; key Dongarra,
Hempel, Hey, Walker
Goal: to come out within a few years time frame with a
standard for message-passing parallel programming; building on
lessons learned from existing interfaces/languages
•Not a research effort (as such)!
•Open to participation from all interested parties
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Key technical design points
•Basic message-passing and related functionality (collective
communication!)
•Enable library building: safe encapsulation of messages (and
other things, eg. query functionality)
•High performance, across all available and future systems!
•Scalable design
•Support for C and Fortran
MPI should encompass and enable
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The MPI Forum
Not and ANSI/IEEE Standardization body, nobody „owns“
the MPI standard; „free“
Open to participation for all interested parties; protocols
open (votes, email discussions)
Regular meetings, 6-8 week intervals
Those who participates at meetings (with a history) can
vote, one vote per organization (current discussion:
quorum, semantics of abstaining)
The 1st MPI Forum set out out to work early 1993
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Errata, minor adjustments: MPI 1.0, 1.1, 1.2: 1994-1995
After 7 meetings, 1st version of the MPI Standard was ready
early 1994. Two finalizing meetings in February 1994
MPI: A Message-Passing Interface standard. May 5th, 1994
The standard is the 226 page pdf-document
that can be found at www.mpi-forum.org
as voted by the MPI Forum
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Take note:
The MPI 1 standardization process was followed hand-in-hand
by a(n amazingly good) prototype implementation: mpich from
Argonne National Laboratory (Gropp, Lusk, …)
[W. Gropp, E. L. Lusk, N. E. Doss, A. Skjellum: A High-Performance,
Portable Implementation of the MPI Message Passing Interface
Standard. Parallel Computing 22(6): 789-828, 1996]
Other parties, vendors could build on this implementation (and
did!), so that MPI was quickly supported on many parallel
systems
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Why MPI has been successful: an appreciation
•abstractions, but is still close enough to common architectures
to allow efficient, low overhead implementations („MPI is the
assembler of parallel computing…“);
•is formulated with care and precision; but not a formal
specification
•is complete (to a high degree), based on few, powerful, largely
orthogonal key concepts (few exceptions, few optionals)
•and few mistakes
MPI made some fundamental
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i
j
m
k
l
Entities: MPI
processes
can communicate through a
communication medium
that can be
implemented as
„processes“ (most
MPI
implementations),
„threads“, …
Communication
medium: concrete
network,…
nature of which is of no concern to the MPI standard:
•No explicit requirements on network structure or capabilities
•No performance model or requirements
Message-passing abstraction
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Only processes in same communicator: ranked set of processes
with unique „context“ - can communicate
Fundamental library building concept: isolates communication
in library routines from application communication
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
MPI implementation must ensure reliable transmission; no time
out (see RT-MPI)
Semantics: messages from same sender are delivered in order;
possible to write fully deterministic programs
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
Sending process may block or not… this is not synchronous
communication (as in CSP; close to this, synchronous MPI_Ssend)
Semantics: upon return, data buffer can safely be reused
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Isend(&data,count,type,j,tag,comm,&req);
MPI_Irecv(&data,count,type,i,tag,comm,&req);
Basic message-passing: point-to-point communication
Receiving process
returns
immediately, data
buffer must not
be touched
Non-blocking communication: MPI_Isend/MPI_Irecv
Explicit completion: MPI_Wait(&req,&status), …
Design principle: MPI specification shall not enforce internal
buffering, all communication memory in user space…
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
Basic message-passing: point-to-point communication
Receiving process
returns
immediately, data
buffer must not
be touched
Non-blocking communication: MPI_Isend/MPI_Irecv
Explicit completion: MPI_Wait(&req,&status), …
Design choice: No progress rule, communication will/must
eventually happen
MPI_Isend(&data,count,type,j,tag,comm,&req);
MPI_Irecv(&data,count,type,i,tag,comm,&req);
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
Basic message-passing: point-to-point communication
Completeness: MPI_Send, Isend, Issend, …, MPI_Recv, Irecv can
be combined, semantics make sense
Receiving process
returns
immediately, data
buffer must not
be touched
MPI_Isend(&data,count,type,j,tag,comm,&req);
MPI_Irecv(&data,count,type,i,tag,comm,&req);
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
MPI_Datatype describes structure of communication data
buffer: basetypes MPI_INT, MPI_DOUBLE, …, and
recursively applicable type constructors
data
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
Orthogonality: Any MPI_Datatype can be used in any
communication operation
data
Semantics: only signature of data sent and data received must
match (performance!)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Other functionality (supporting library building)
•Attributes to describe MPI objects (communicators,
datatypes)
•Query functionality for MPI objects (MPI_Status)
•Errorhandlers to influence behavior on errors
•MPI_Group‘s for manipulating ordered sets of processes
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
„MPI is too large“
„MPI is the assembler of parallel computing“…
„MPI is designed not to make easy things easy, but to make
difficult things possible“
Gropp, EuroPVM/MPI 2004
Conjecture (tested at EuroPVM/MPI 2002): for any MPI
feature there will be at least one (significant) user depending
essentially on exactly this feature
Often heard objections/complaints
and two answers
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations:
•MPI_Barrier(comm);
•MPI_Bcast(…,comm);
•MPI_Gather(…,comm); MPI_Scatter(…,comm);
•MPI_Allgather(…,comm);
•MPI_Alltoall(…,comm);
•MPI_Reduce(…,comm); MPI_Allreduce(…,comm);
•MPI_Reduce_scatter(…,comm);
•MPI_Scan(…,comm);
Semantics: all processes in comm participates; blocking; no tags
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations:
•MPI_Barrier(comm);
•MPI_Bcast(…,comm);
•MPI_Gather(…,comm); MPI_Scatter(…,comm);
•MPI_Allgather(…,comm);
•MPI_Alltoall(…,comm);
•MPI_Reduce(…,comm); MPI_Allreduce(…,comm);
•MPI_Reduce_scatter(…,comm);
•MPI_Scan(…,comm);
Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gather
dual of MPI_Scatter. Regular and irregular (vector) variants
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations:
•MPI_Gatherv(…,comm); MPI_Scatterv(…,comm);
•MPI_Allgatherv(…,comm);
•MPI_Alltoallv(…,comm);
•MPI_Reduce_scatter(…,comm);
Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gather
dual of MPI_Scatter. Regular and irregular (vector) variants
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations
Collectives capture complex patterns, often with non-trivial
algorithms and implementations: delegate work to library
implementer, save work for the application programmer
Obligation: MPI implementation must be of sufficiently high
quality – otherwise application programmer will not use or
implement own collectives
This did happen(s)! For datatypes: unused for a long time
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations
Collectives capture complex patterns, often with non-trivial
algorithms and implementations: delegate work to library
implementer, save work for the application programmer
Completeness: MPI makes it possible to (almost) implement MPI
collectives „on top of“ MPI point-to-point communication
Some exceptions for reductions, MPI_Op; datatypes
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations
Collectives capture complex patterns, often with non-trivial
algorithms and implementations: delegate work to library
implementer, save work for the application programmer
Conjecture: well-implemented collective operations contributes
significantly towards application „performance portability“
[Träff, Gropp, Thakur: Self-Consistent MPI Performance Guidelines.
IEEE TPDS 21(5): 698-709, 2010]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0
x
Proc p-1
Three algorithms for matrix-vector multiplication
mxn matrix A and n-element vector y distributed evenly across
p MPI processes: compute z = Ay
Algorithm 1:
•Row-wise matrix
distribution
•Each process needs full
vector: MPI_Allgather(v)
•Compute blocks of result
vector locally
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0 xProc p-1
Algorithm 2:
•Column-wise matrix
distribution
•Compute local partial
result vector
•MPI_Reduce_scatter to
sum and distribute partial
results
Three algorithms for matrix-vector multiplication
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0
Proc 2c
Proc c
Proc c-1…
x
MPI_Allgather
MPI_Allgather
Three algorithms for matrix-vector multiplication
•Algorithm 3:
•Matrix distribution into
blocks of m/r x n/c
elements
•Algorithm 1 on columns
•Algorithm 2 on rows
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0
Proc 2c
Proc c
Proc c-1…
x
MPI_Reduce_scatter
MPI_Reduce_scatter
MPI_Reduce_scatter
p=rc
Three algorithms for matrix-vector multiplication
•Algorithm 3:
•Matrix distribution into
blocks of m/r x n/c
elements
•Algorithm 1 on columns
•Algorithm 2 on rows
Algorithm 3 is more scalable. Partitioning the set processes
(new communicators) is essential!
Interfaces that do support collectives on subsets of processes
are not able to express Algorithm 3: case in point UPC
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Three algorithms for matrix-vector multiplication
For the „regular“ case where p divides n (and p=rc)
•Regular collectives: MPI_Allgather, MPI_Reduce_scatter
For the „irregular“ case
•Irregular collectives: MPI_Allgatherv, MPI_Reduce_scatter
MPI 1.0 defined regular/irregular versions – completeness –
for all the considered collective patterns; except for
MPI_Reduce_scatter
Performance: irregular subsume regular counterparts; but
much better algorithms are known for the regular ones
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
[R. A. van de Geijn, J. Watts: SUMMA: scalable
universal matrix multiplication algorithm. Concurrency -
Practice and Experience 9(4): 255-274 (1997)]
[Ernie Chan, Marcel Heimlich, Avi Purkayastha, Robert A. van de Geijn:
Collective communication: theory, practice, and experience. Concurrency
and Computation: Practice and Experience 19(13): 1749-1783 (2007)]
[F. G. van Zee, E. Chan, R. A. van de Geijn, E. S.
Quintana-Ortí, G. Quintana-Ortí: The libflame Library
for Dense Matrix Computations. Computing in Science
and Engineering 11(6): 56-63 (2009)]
A lesson: Dense Linear Algebra and (regular) collective
communication as offered by MPI go hand in hand
Note: Most of these collective communication algorithms are a
factor 2 off from best possible
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Another example: Integer (bucket) sort
n integers in a given range [0,R-1], distributed evenly across p
MPI processes: m= n/p integers per process
0 1 3 0 0 2 0 1 …
4
2
1
3
Step 1: bucket sort locally, let B[i] number of elements with key i
Step 2: MPI_Allreduce(B,AllB,R,MPI_INT,MPI_SUM,comm);
Step 3: MPI_Exscan(B,RelB,R,MPI_INT,MPI_SUM,comm);)
B =A =
Now: Element A[j] needs to go to position AllB[A[j]-1]+RelB[A[j]]+j’
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Another example: Integer (bucket) sort
n integers in a given range [0,R-1], distributed evenly across p
MPI processes: m= n/p integers per process
0 1 3 0 0 2 0 1 …
4
2
1
3
Step 4: compute number of elements to be sent to each other
process, sendelts[i], i=0,…,p-1
B =A =
Step 5:
MPI_Alltoall(sendelts,1,MPI_INT,recvelts,1,MPI_INT,comm);
Step6: redistribute elements
MPI_Alltoallv(A,sendelts,sdispls,…,comm);
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Another example: Integer (bucket) sort
The algorithm is stable Radixsort
Choice of radix R depends on properties of network (fully
connected, fat tree, mesh/torus, …) and quality of
reduction/scan-algorithms
The algorithm is portable (by virtue of the MPI collectives),
but tuning depends on systems – concrete performance model
needed, but this is outside scope of MPI
Note: on strong network T(MPI_Allreduce(m)) = O(m+log p)
NOT: O(mlog p)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
A last feature
Process topologies:
Specify application communication pattern (as either directed
graph or Cartesian grid) to MPI library, let library assign
processes to processors so as to improve communication
follwing specified pattern
MPI version: collective communicator construction functions,
process ranks in new communicator represent new (improved)
mapping
And a very last: (simple) tool building support – the MPI profiling
interface
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The mistakes
•MPI_Cancel(), semantically ill-defined, difficult to implement; a
concession to RT?
•MPI_Rsend(); vendors got too much leverage?
•MPI_Pack/Unpack; was added as an afterthought in last 1994
meetings
•Some functions enforce full copy of argument (list)s into
library
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The mistakes
•MPI_Cancel(), semantically ill-defined, difficult to implement; a
concession to RT? (but useful for certain patterns, e.g., double
buffering, client-server-like, …)
•MPI_Rsend(); vendors got too much leverage? (but
advantageous in some scenarios)
•MPI_Pack/Unpack; was added as an afterthought in last 1994
meetings (functionality is useful/needed, limitations in the
specification)
•Some functions enforce full copy of argument (list)s into
library
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Missing functionality
•Datatype query functions – not possible to query/reconstruct
structure specified by given datatype
•Some MPI objects are not first class citizens (MPI_Aint,
MPI_Op, MPI_Datatype); makes it difficult to build certain
types of libraries
•Reductions cannot be performed locally
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Definition:
An MPI construct is non-scalable, if memory or time overhead(*)
is Ω(p), p number of processes
Questions:
•Are there aspects of the MPI specification that are non-
scalable (forces Ω(p) memory or time)?
•Are there aspects of (typical) MPI implementations that are
non-scalable
(*)cannot be accounted for in application
Is MPI scalable?
Question must distinguish between specification and
implementation
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Answer is “yes” to both questions
Example:
Irregular collective alltoall communication (each process
exchange some data with each other process)
MPI_Alltoallw(sendbuf,sendcounts[],senddispl[],sendtypes[],
recvbuf,recvcounts[],recvdispls[],recvtypes[],…)
takes 6 p-sized arrays (4- or 8-byte integers) ~ 5MBytes, 10%
of memory on BlueGene/L
Sparse usage pattern: often each process exchanges with
only few neighbors, so most send/recvcounts[i]=0
MPI_Alltoallw is non-scalable
Pi
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Experiment:
sendcounts[i]=0, recvcounts[i] =0 for all processes and all i
Argonne Natl. Lab
BlueGene/L
Entails no communication
[Balaji, …, Träff : MPI on millions of cores. Parallel Processing Letters
21(1): 45-60, 2011]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Definitely non-scalable features in MPI 1.0
•Irregular collectives: p-sized lists of counts, displacements,
types
•Graph topology interface: requires specification of full
process topology (communication graph) by all processes
(Cartesian topology interface is perfectly scalable, and much
used)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 2: what (almost) went wrong
A number of issues/desired functionality were left open by
MPI 1.0, either because of
•no agreement
•deadline, desire to get a consolidated standard out in time
Major open issues
•Parallel IO
•One-sided communication
•Dynamic process management
were partly described in the socalled JOD: „Journal of
Development“ (see www.mpi-forum.org)
The challenge from PVM
Challenge from SHMEM…
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI Forum started to reconvene already in 1995
Between 1995-1997 there were 16 meetings which lead to
MPI 2.0
MPI 1.0:
226 pages
MPI 2.0: 356
additional pages
Major new features, with new concepts: extended message
passing models
1. Dynamic process management
2. One-sided communication
3. MPI-IO
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
MPI 2.0 process management relies on inter-communicators
(from MPI 1.0) to establish communication with newly started
processes or already running applications
•MPI_Comm_spawn
•MPI_Comm_connect/MPI_Comm_accept
•MPI_Intercomm_merge
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
What if a process (in a communicator) dies? The fault-
tolerance problem
Most (all) MPI implementations also die – but this may be an
implementation issue
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
What if a process (in a communicator) dies? The fault-
tolerance problem
If implementation does not die, it might be possible to
program around/isolate faults using MPI 1.0 error handlers
and inter-communicators
[W. Gropp, E. Lusk: Fault Tolerance in Message Passing Interface
Programs. IJHPCA 18(3): 363-372, 2004]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
What if a process (in a communicator) dies? The fault-
tolerance problem
The issue is contentious&contagious…
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. One-sided communication
Motivations/arguments:
•Expressivity/convenience: For applications where only one
process may readily know with which process to communicate
data, the point-to-point message-passing communication model
may be inconvenient
•Performance: On some architectures point-to-point
communication could be inefficient; e.g. if shared-memory is
available
Challenge: define a model that captures the essence of one-
sided communication, but can be implemented without requiring
specific hardware support
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. One-sided communication
Challenge: define a model that captures the essence of one-
sided communication, but can be implemented without requiring
specific hardware support
New MPI 2.0 concepts: communication window, communication
epoch
MPI one-sided model cleanly separates communication from
synchronization; three specific synchronization mechanisms
•MPI_Win_fence
•MPI_Win_Start/Complete/Post/Wait
•MPI_Win_lock/unlock
with cleverly thought out semantics and memory model
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. One-sided communication
MPI one-sided model cleanly separates communication from
synchronization; three specific synchronization mechanisms
•MPI_Win_fence
•MPI_Win_start/complete/post/wait
•MPI_Win_lock/unlock
with cleverly thought out semantics and memory model
Unfortunately, application programmers did not seem to like it
•“too complicated“
•“too rigid”
•“not efficient“
•…
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
3. MPI-IO
Communication with external (disk/file) memory. Could leverage
MPI concepts and implementations:
•Datatypes to describe file structure
•Collective communication for utilizing local file systems
•Fast communication
MPI datatype mechanism is essential, and the power of this
concept starts to become clear
MPI 2.0 introduces (inelegant!) functionality to decode a
datatype = discover the structure described by datatype.
Needed for MPI-IO implementation (on top of MPI) and
supports library building
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Take note:
Apart from MPI-IO (ROMIO), the MPI 2.0 standardization
process was not followed by prototype implementations
New concept (IO only): split collectives
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Not discussed:
Thread-support/compliance, the ability of MPI to work in a
threaded environment
•MPI 1.0: design is (largely: exception: MPI_Probe/MPI_Recv)
thread safe; recommendation that MPI implementations be
thread safe (contrast: PVM design)
•MPI 2.0: level of thread support can be requested and queried;
an MPI library is not required to support the requested level,
but returns information on the highest smaller level supported
MPI_THREAD_SINGLE
MPI_THREAD_FUNNELED
MPI_THREAD_SERIALIZED
MPI_THREAD _MULTIPLE
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Quiet years: 1997-2006
No standardization activity from 1997
MPI 2.0 implementations
•Fujitsu (claim) 1999
•NEC 2000
•mpich 2004
•OpenMPI 2005
•LAM/MPI 2005(?)
•…
Ca. 2006 most/many implementations support mostly full MPI 2.0
Implementations evolved and improved; MPI was an interesting
topic to work on, good MPI work was/is acceptable to all parallel
computing conferences (SC, IPDPS, ICPP, Euro-Par, PPoPP, SPAA)
[J. L.Träff, H. Ritzdorf, R. Hempel:
The Implementation of MPI-2 One-
Sided Communication for the NEC
SX-5. SC 2000]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2012 Wien, Austria
2011 Santorini, Greece
2010 Stuttgart, Germany EuroMPI (no longer PVM)
2009 Helsinki, Finland EuroPVM/MPI
2008 Dublin, Ireland
2007 Paris, France
2006 Bonn, Germany
2005 Sorrento, Italy
2004 Budapest, Hungary
2003 Venice, Italy
2002 Linz, Austria
2001 Santorini, Greece (9.11 – did not actually take place)
2000 Balatonfüred, Hungary
1999 Barcelona, Spain
1998 Liverpool, UK
1997 Cracow, Poland Now EuroPVM/MPI
1996 Munich, Germany
1995 Lyon, France
1994 Rome, Italy EuroPVM
Euro(PVM/)MPI conference series: dedicated to MPI
MPIForummeetings
…biased
towards
MPI
implementa
tion
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
[Thanks to Xavier Vigouroux, Vienna 2012]
Bonn 2006: discussions („Open Forum“)
on restarting MPI Forum starting
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The MPI 2.2 – MPI 3.0 process
Late 2007: MPI Forum reconvenes - again
Consolidate standard: MPI 1.2 and MPI 2.0 into single
standard document: MPI 2.1 (Sept. 4th, 2008)
MPI 2.2 intermediate step towards 3.0
•Address scalability problems
•Missing functionality
•BUT preserve backwards compatibility
586 pages 623 pages
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Some MPI 2.2 features
•Addressing scalability problems: new topology interface,
application communication graph is specified in a distributed
fashion
•Library building: MPI_Reduce_local
•Missing function: regular MPI_Reduce_scatter_block
•More flexible MPI_Comm_create (more in MPI 3.0:
MPI_Comm_split_type)
•New datatypes, e.g. MPI_AINT
[T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski,
R. Thakur, J. L. Träff: The scalable process topology
interface of MPI 2.2. Concurrency and Computation: Practice
and Experience 23(4): 293-310, 2011]
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Some MPI 2.2 features
C++ bindings (since MPI 2.0) deprecated! With the intention
that they will be removed
MPI_Op, MPI_Datatype still not first class citizens (datatype
support is weak and cumbersome)
Fortran bindings modernized and corrected
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
•6 meetings 2012
•6 meetings 2011
•7 meetings 2010
•6 meetings 2009
•7 meetings 2008
MPI 2.1, 2.2, and 3.0 process
Total: 32 meetings (and counting…)
Recall:
•MPI 1: 7 meetings
•MPI 2.0: 16 meetings
MPI Forum rules: presence at physical meetings with a history
(presence at past two meetings) required to vote
Requirement: new functionality must be supported by use-case
and prototype implementation; backwards compatibility not strict
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 2.2 – MPI 3.0 process had working groups on
•Collectives Operations
•Fault Tolerance
•Fortran bindings
•Generalized requests ("on hold")
•Hybrid Programming
•Point to point (this working group is "on hold")
•Remote Memory Access
•Tools
•MPI subsetting ("on hold")
•Backward Compatibility
•Miscellaneous Items
•Persistence
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 3.0: new features, new themes, new opportunities
Major new functionalities:
1. Non-blocking collectives
2. Sparse collectives
3. New one-sided communication
4. Performance tool support
Deprecated functions removed: C++ interface has gone
MPI 3.0, 21. September 2012: 822 pages
Implementation status: mpich should cover MPI 3.0
Performance/quality?
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Non-blocking collectives
Introduced for performance (overlap) and convenience reasons
Similarly to non-blocking point-to-point routines; MPI_Request
object to check and enforce progress
Sound semantics based on ordering, no tags
Different from point-to-point (with good reason): blocking and
non-blocking collectives do not mix and match: MPI_Ibcast() is
incorrect with MPI_Bcast()
Incomplete: non-blocking versions for some other collectives
(MPI_Icomm_dup)
Non-orthognal: split and non-blocking collectives
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. Sparse collectives
Addresses scalability problem of irregular collectives.
Neighborhood specified with topology functionality
MPI_Neighbor_allgather(…,comm);
MPI_Neighbor_allgatherv(…,comm);
MPI_Neighbor_alltoall(…,comm);
MPI_Neighbor_alltoallv(…,comm);
MPI_Neighbor_alltoallw(…,comm);
Pi
and corresponding non-blocking versions
[T. Hoefler, J. L. Träff: Sparse collective operations for MPI. IPDPS
2009]
Will users take up? Optimization potential?
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
[Hoefler, Dinan, Buntinas, Balaji, Barrett, Brightwell, Gropp, Kale,
Thakur: Leveraging MPI‘s one-sided communication for shared-memory
programming. EuroMPI 2012, LNCS 7490, 133-141, 2012]
3. One-sided communication
Model extension for better performance on hybrid/shared
memory systems
Atomic operations (lacking in MPI 2.0 model)
Per operation local completion, MPI_Rget, MPI_Rput, … (but
only for passive synchronization)
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
4. Performance tool support
Problem of MPI 1.0 allowing only one profiling interface at a
time (linker interception of MPI calls) NOT solved
Functionality added to query certain internals of the MPI
library
Will tool writers take up?
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI at a turning point
Extremely large-scale systems
now appearing stretch the
scalability of MPI
Is MPI for exascale?
•Heterogeneous?
•memory constrained?
•low bisection width?
•unreliable?
systems
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI Forum at a turning point
Attendance large enough?
Attendance broad enough?
MPI 2.1-MPI 3.0 process has been long and exhausting,
attendance driven by implementors, relatively little input form
users and applications, non-technical goals have played a role;
research conducted that did not lead to useful outcome for the
standard (fault tolerance, thread/hybrid support, persistence,
…)
Perhaps time to take a break?
More meetings,
smaller attendance
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 2.2
YES 24
NO 1
ABSTAIN 0
MISSED 0
Passed
September 2-4, 2009
MPI 2.1 first
YES 22
NO 0
ABSTAIN 0
MISSED 0
Passed
June 30-July 2, 2008
MPI 3.0
YES 17
NO/ABSTAIN 0/0
Passed
MPI 2.1 second
YES 23
NO 0
ABSTAIN 0
MISSED 0
Passed
MPI 1.3 second
YES 22
NO 1
ABSTAIN 0
MISSED 0
Passed
September 3-5, 2008
The recent votes (MPI 2.1 to MPI 3.0)
September 20-21, 2012
Change/discussion of
rules
See meetings.mpi-forum.org/secretary/
©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Summary:
Study history and learn from it: how to do better than MPI
Standardization is a major effort, has taken a lot of dedication
and effort from a relatively large (but declining?) group of
people and institutions/companies
MPI 3.0 will raise many new implementation challenges
MPI 3.0 is not the end of the (hi)story
Thanks to the MPI Forum; discussion with Bill Gropp, Rusty Lusk, Rajeev
Thakur, Jeff Squyres, and others

Contenu connexe

Tendances

What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?Jeff Squyres
 
MPI Presentation
MPI PresentationMPI Presentation
MPI PresentationTayfun Sen
 
Parallel programming using MPI
Parallel programming using MPIParallel programming using MPI
Parallel programming using MPIAjit Nayak
 
Message Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationMessage Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationHimanshi Kathuria
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelIntel® Software
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntel® Software
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Ravindra Raju Kolahalam
 
Collective Communications in MPI
 Collective Communications in MPI Collective Communications in MPI
Collective Communications in MPIHanif Durad
 
Interprocess communication (IPC) IN O.S
Interprocess communication (IPC) IN O.SInterprocess communication (IPC) IN O.S
Interprocess communication (IPC) IN O.SHussain Ala'a Alkabi
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMPDivya Tiwari
 
Interprocess communication
Interprocess communicationInterprocess communication
Interprocess communicationSushil Singh
 
ITFT_Inter process communication
ITFT_Inter process communicationITFT_Inter process communication
ITFT_Inter process communicationSneh Prabha
 

Tendances (20)

What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
 
MPI Presentation
MPI PresentationMPI Presentation
MPI Presentation
 
Message passing interface
Message passing interfaceMessage passing interface
Message passing interface
 
Chapter 6 pc
Chapter 6 pcChapter 6 pc
Chapter 6 pc
 
MPI Tutorial
MPI TutorialMPI Tutorial
MPI Tutorial
 
Mpi
Mpi Mpi
Mpi
 
Parallel programming using MPI
Parallel programming using MPIParallel programming using MPI
Parallel programming using MPI
 
Mpi Java
Mpi JavaMpi Java
Mpi Java
 
Message Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationMessage Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communication
 
Move Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next LevelMove Message Passing Interface Applications to the Next Level
Move Message Passing Interface Applications to the Next Level
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]
 
Collective Communications in MPI
 Collective Communications in MPI Collective Communications in MPI
Collective Communications in MPI
 
Mpi Test Suite Multi Threaded
Mpi Test Suite Multi ThreadedMpi Test Suite Multi Threaded
Mpi Test Suite Multi Threaded
 
More mpi4py
More mpi4pyMore mpi4py
More mpi4py
 
Interprocess communication (IPC) IN O.S
Interprocess communication (IPC) IN O.SInterprocess communication (IPC) IN O.S
Interprocess communication (IPC) IN O.S
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMP
 
Interprocess communication
Interprocess communicationInterprocess communication
Interprocess communication
 
ITFT_Inter process communication
ITFT_Inter process communicationITFT_Inter process communication
ITFT_Inter process communication
 

Similaire à MPI History

Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Marcirio Chaves
 
parallel programming models
 parallel programming models parallel programming models
parallel programming modelsSwetha S
 
Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...ijpla
 
The Seven Main Challenges of an Early Warning System Architecture
The Seven Main Challenges of an Early Warning System ArchitectureThe Seven Main Challenges of an Early Warning System Architecture
The Seven Main Challenges of an Early Warning System Architecturestreamspotter
 
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...Deltares
 
Computer Architecture: A quantitative approach - Cap4 - Section 1
Computer Architecture: A quantitative approach - Cap4 - Section 1Computer Architecture: A quantitative approach - Cap4 - Section 1
Computer Architecture: A quantitative approach - Cap4 - Section 1Marcelo Arbore
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and KerasJie He
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters DefensederDoc
 
Dynamic V&V in Language-Oriented Modeling
Dynamic V&V in Language-Oriented ModelingDynamic V&V in Language-Oriented Modeling
Dynamic V&V in Language-Oriented ModelingBenoit Combemale
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSijdpsjournal
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collectionsijdpsjournal
 
Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)Benoit Combemale
 
A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...
A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...
A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...Benoit Combemale
 
A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesJason Hearne-McGuiness
 

Similaire à MPI History (20)

Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
 
parallel programming models
 parallel programming models parallel programming models
parallel programming models
 
Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...Towards high performance computing(hpc) through parallel programming paradigm...
Towards high performance computing(hpc) through parallel programming paradigm...
 
The Seven Main Challenges of an Early Warning System Architecture
The Seven Main Challenges of an Early Warning System ArchitectureThe Seven Main Challenges of an Early Warning System Architecture
The Seven Main Challenges of an Early Warning System Architecture
 
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
 
Computer Architecture: A quantitative approach - Cap4 - Section 1
Computer Architecture: A quantitative approach - Cap4 - Section 1Computer Architecture: A quantitative approach - Cap4 - Section 1
Computer Architecture: A quantitative approach - Cap4 - Section 1
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters Defense
 
Dynamic V&V in Language-Oriented Modeling
Dynamic V&V in Language-Oriented ModelingDynamic V&V in Language-Oriented Modeling
Dynamic V&V in Language-Oriented Modeling
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
Computer cluster
Computer clusterComputer cluster
Computer cluster
 
Computer cluster
Computer clusterComputer cluster
Computer cluster
 
20090720 smith
20090720 smith20090720 smith
20090720 smith
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)
 
A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...
A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...
A Tool-Supported Approach for Omniscient Debugging and Concurrent Execution o...
 
A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel Architectures
 

Plus de Jeff Squyres

Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFJeff Squyres
 
MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumJeff Squyres
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFJeff Squyres
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFJeff Squyres
 
Cisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricCisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricJeff Squyres
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZEJeff Squyres
 
Fun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byFun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byJeff Squyres
 
Open MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapOpen MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapJeff Squyres
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPIJeff Squyres
 
Cisco usNIC libfabric provider
Cisco usNIC libfabric providerCisco usNIC libfabric provider
Cisco usNIC libfabric providerJeff Squyres
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedbackJeff Squyres
 
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and EverythingJeff Squyres
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPIJeff Squyres
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationJeff Squyres
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Jeff Squyres
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkJeff Squyres
 
Ethernet and TCP optimizations
Ethernet and TCP optimizationsEthernet and TCP optimizations
Ethernet and TCP optimizationsJeff Squyres
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsJeff Squyres
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposalJeff Squyres
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for youJeff Squyres
 

Plus de Jeff Squyres (20)

Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOF
 
MPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI ForumMPI Sessions: a proposal to the MPI Forum
MPI Sessions: a proposal to the MPI Forum
 
MPI Fourm SC'15 BOF
MPI Fourm SC'15 BOFMPI Fourm SC'15 BOF
MPI Fourm SC'15 BOF
 
Open MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOFOpen MPI SC'15 State of the Union BOF
Open MPI SC'15 State of the Union BOF
 
Cisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricCisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to Libfabric
 
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZE
 
Fun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-byFun with Github webhooks: verifying Signed-off-by
Fun with Github webhooks: verifying Signed-off-by
 
Open MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmapOpen MPI new version number scheme and roadmap
Open MPI new version number scheme and roadmap
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPI
 
Cisco usNIC libfabric provider
Cisco usNIC libfabric providerCisco usNIC libfabric provider
Cisco usNIC libfabric provider
 
2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback2014 01-21-mpi-community-feedback
2014 01-21-mpi-community-feedback
 
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
 
MOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talkMOSSCon 2013, Cisco Open Source talk
MOSSCon 2013, Cisco Open Source talk
 
Ethernet and TCP optimizations
Ethernet and TCP optimizationsEthernet and TCP optimizations
Ethernet and TCP optimizations
 
Friends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_RequestsFriends don't let friends leak MPI_Requests
Friends don't let friends leak MPI_Requests
 
MPI-3 Timer requests proposal
MPI-3 Timer requests proposalMPI-3 Timer requests proposal
MPI-3 Timer requests proposal
 
MPI_Mprobe is good for you
MPI_Mprobe is good for youMPI_Mprobe is good for you
MPI_Mprobe is good for you
 

Dernier

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Dernier (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

MPI History

  • 1. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 History and Development of the MPI Standard Jesper Larsson Träff Vienna University of Technology Faculty of Informatics, Institute of Information Systems Research Group Parallel Computing Favoritenstrase 16, 1040 Wien www.par.tuwien.ac.at
  • 2. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 21. September, 2012, MPI Forum meeting in Vienna: MPI 3.0 has just been released … … but MPI has a long history and it is instructive to look at that www.mpi-forum.org
  • 3. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 „Those who cannot remember the past are condemned to repeat it”, George Santyana, The Life of Reason, 1905-1906 “History always repeats itself twice: first time as tragedy, second time as farce”, Karl Marx “History is written by the winners”, George Orwell, 1944 (but he quotes from someone else)
  • 4. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Last quote: „history“ depends. Who tells it, and why? What informations is available? What‘s at stake? My stake: •Convinced of MPI as a well designed and extremely useful standard, that has posed productive research/development problems, with a broader parallel computing relevance •Critical of current standardization effort, MPI 3.0 •MPI implementer, 2000-2010 with NEC •MPI Forum member 2008-2010 (with Hubert Ritzdorf, representing NEC) •Voted „no“ to MPI 2.2
  • 5. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 A long debate: shared-memory vs. distributed memory Question: What shall a parallel machine look like? M P P P P Answer depends •What are your concerns? •What is desirable? •What is feasible? causing debate since (at least) the 70ties, 80ties ?
  • 6. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Hoare/Dijkstra: Parallel programs shall be structured as collections of communicating, sequential processes Their concern: CORRECTNESS Wyllie, Vishkin: A parallel algorithm is like a collection of synchronized sequential algorithms that access a common shared memory, and the machine is a PRAM Their concern: (asymptotic) PERFORMANCE And, of course, PERFORMANCE: many, many practitioneers And, of course, CORRECTNESS: Hoare semantics
  • 7. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Hoare/Dijkstra: Parallel programs shall be structured as collections of communicating, sequential processes Wyllie, Vishkin: A parallel algorithm is like a collection of synchronized sequential algorithms that access a common shared memory, and the machine is a PRAM [Fortune, Wyllie: Parallelism in Random Access Machines. STOC 1978: 114-118] [Shiloach, Vishkin: Finding the Maximum, Merging, and Sorting in a Parallel Computation Model. Jour. Algorithms 2(1): 88-102, 1981] [C. A. R. Hoare: Communicating Sequential Processes. Comm. ACM 21(8): 666-677, 1978]
  • 8. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Hoare/Dijkstra: Parallel programs shall be structured as collections of communicating, sequential processes Wyllie, Vishkin: (many, many practiotioneers, Burton-Smith, …) A parallel algorithm is like a collection of synchronized sequential algorithms that access a common shared memory, and the machine is a PRAM M P P P P Neither perhaps cared too much about how to build machines… Neither perhaps cared too much about how to build machines (in the beginning)
  • 9. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 The INMOS transputer T400, T800, from ca. 1985 …but others (fortunately) did A complete architecture entirely based on the CSP idea. An original programming language, OCCAM (1983, 1987) Parsytec (ca. 1988-1995)
  • 10. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Intel iPSC/2 ca. 1990 Intel Paragon, ca. 1992 IBM SP/2 ca. 1996 Thinking machines CM5, ca. 1994
  • 11. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MasPar (1987.1996) MP2 Thinking Machines (1982-94) CM2, CM5 KSR 2, ca. 1992 HEP Denelcor, mid 1980ties
  • 12. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Ironically… Despite algorithmically stronger properties and potential for scaling to much, much larger numbers of processors of shared- memory models (like the PRAM) practically, high-performance systems with (quite) substantial parallelism have all been distributed-memory systems and the corresponding de facto standard – MPI (the Message-Passing Interface) is much stronger than (say) OpenMP
  • 13. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Sources of MPI: the early years Commercial vendors and national laboratories (including many European) needed practically working programming support for their machines and applications Early 90ties fruitful years for practical parallel computing (funding for „grand challenge“ and „star wars“) Vendors and labs proposed and maintained own languages, interfaces, libraries for parallel programming (early 90ties) •Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL, PVM, P4, OCCAM, …
  • 14. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 •Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL, PVM, P4, OCCAM, … intended for distributed memory machines, and centered around similar concepts Similar enough to warrant an effort towards creating a common standard for message-passing based parallel programming Portability problem: wasted effort in maintaining own interface for small user group, lack of portability across systems
  • 15. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 •Intel NX: send-receive message passing (non-blocking, buffering?), tags(tag groups?), no group concept, some collectives, weak encapsulation •IBM EUI: point-to-point and collectives (more than in MPI), group concept, high performance (??) [Snir et al.] •IBM CCL: point-to-point and collectives, encapsulation •Zipcode/Express: point-to-point, emphasis on library building [Skjellum] •PARMACS/Express: point-to-point, topological mapping [Hempel] •PVM: point-to-point communication, some collective, virtual machine abstraction, fault-tolerance Message-passing interfaces/languages early 90ties
  • 16. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Some odd men out •Linda: tuple space get/put – a first PGAS approach? •Active messages; seems to presuppose an SPMD model? •OCCAM: too strict CSP-based, synchronous message passing? •PVM: heterogeneous systems, fault-tolerance, … [Hempel, Hey, McBryan, Walker: Special Issue – Message Passing Interfaces. Parallel Computing 29(4), 1994]
  • 17. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Standardization: the MPI Forum and MPI 1.0 [Hempel, Walker: The emergence of the MPI message passing standard for parallel computing. Computer Standards & Interfaces, 21: 51-62, 1999] A standardization effort was started early 1992; key Dongarra, Hempel, Hey, Walker Goal: to come out within a few years time frame with a standard for message-passing parallel programming; building on lessons learned from existing interfaces/languages •Not a research effort (as such)! •Open to participation from all interested parties
  • 18. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Key technical design points •Basic message-passing and related functionality (collective communication!) •Enable library building: safe encapsulation of messages (and other things, eg. query functionality) •High performance, across all available and future systems! •Scalable design •Support for C and Fortran MPI should encompass and enable
  • 19. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 The MPI Forum Not and ANSI/IEEE Standardization body, nobody „owns“ the MPI standard; „free“ Open to participation for all interested parties; protocols open (votes, email discussions) Regular meetings, 6-8 week intervals Those who participates at meetings (with a history) can vote, one vote per organization (current discussion: quorum, semantics of abstaining) The 1st MPI Forum set out out to work early 1993
  • 20. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Errata, minor adjustments: MPI 1.0, 1.1, 1.2: 1994-1995 After 7 meetings, 1st version of the MPI Standard was ready early 1994. Two finalizing meetings in February 1994 MPI: A Message-Passing Interface standard. May 5th, 1994 The standard is the 226 page pdf-document that can be found at www.mpi-forum.org as voted by the MPI Forum
  • 21. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Take note: The MPI 1 standardization process was followed hand-in-hand by a(n amazingly good) prototype implementation: mpich from Argonne National Laboratory (Gropp, Lusk, …) [W. Gropp, E. L. Lusk, N. E. Doss, A. Skjellum: A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Computing 22(6): 789-828, 1996] Other parties, vendors could build on this implementation (and did!), so that MPI was quickly supported on many parallel systems
  • 22. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Why MPI has been successful: an appreciation •abstractions, but is still close enough to common architectures to allow efficient, low overhead implementations („MPI is the assembler of parallel computing…“); •is formulated with care and precision; but not a formal specification •is complete (to a high degree), based on few, powerful, largely orthogonal key concepts (few exceptions, few optionals) •and few mistakes MPI made some fundamental
  • 23. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j m k l Entities: MPI processes can communicate through a communication medium that can be implemented as „processes“ (most MPI implementations), „threads“, … Communication medium: concrete network,… nature of which is of no concern to the MPI standard: •No explicit requirements on network structure or capabilities •No performance model or requirements Message-passing abstraction
  • 24. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j MPI_Send(&data,count,type,j,tag,comm); MPI_Recv(&data,count,type,i,tag,comm,&status); Basic message-passing: point-to-point communication Only processes in same communicator: ranked set of processes with unique „context“ - can communicate Fundamental library building concept: isolates communication in library routines from application communication
  • 25. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j MPI_Send(&data,count,type,j,tag,comm); MPI_Recv(&data,count,type,i,tag,comm,&status); Basic message-passing: point-to-point communication Receiving process blocks until data have been transferred MPI implementation must ensure reliable transmission; no time out (see RT-MPI) Semantics: messages from same sender are delivered in order; possible to write fully deterministic programs
  • 26. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j MPI_Send(&data,count,type,j,tag,comm); MPI_Recv(&data,count,type,i,tag,comm,&status); Basic message-passing: point-to-point communication Receiving process blocks until data have been transferred Sending process may block or not… this is not synchronous communication (as in CSP; close to this, synchronous MPI_Ssend) Semantics: upon return, data buffer can safely be reused
  • 27. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j MPI_Isend(&data,count,type,j,tag,comm,&req); MPI_Irecv(&data,count,type,i,tag,comm,&req); Basic message-passing: point-to-point communication Receiving process returns immediately, data buffer must not be touched Non-blocking communication: MPI_Isend/MPI_Irecv Explicit completion: MPI_Wait(&req,&status), … Design principle: MPI specification shall not enforce internal buffering, all communication memory in user space…
  • 28. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j Basic message-passing: point-to-point communication Receiving process returns immediately, data buffer must not be touched Non-blocking communication: MPI_Isend/MPI_Irecv Explicit completion: MPI_Wait(&req,&status), … Design choice: No progress rule, communication will/must eventually happen MPI_Isend(&data,count,type,j,tag,comm,&req); MPI_Irecv(&data,count,type,i,tag,comm,&req);
  • 29. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j Basic message-passing: point-to-point communication Completeness: MPI_Send, Isend, Issend, …, MPI_Recv, Irecv can be combined, semantics make sense Receiving process returns immediately, data buffer must not be touched MPI_Isend(&data,count,type,j,tag,comm,&req); MPI_Irecv(&data,count,type,i,tag,comm,&req);
  • 30. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j MPI_Send(&data,count,type,j,tag,comm); MPI_Recv(&data,count,type,i,tag,comm,&status); Basic message-passing: point-to-point communication Receiving process blocks until data have been transferred MPI_Datatype describes structure of communication data buffer: basetypes MPI_INT, MPI_DOUBLE, …, and recursively applicable type constructors data
  • 31. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 i j MPI_Send(&data,count,type,j,tag,comm); MPI_Recv(&data,count,type,i,tag,comm,&status); Basic message-passing: point-to-point communication Receiving process blocks until data have been transferred Orthogonality: Any MPI_Datatype can be used in any communication operation data Semantics: only signature of data sent and data received must match (performance!)
  • 32. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Other functionality (supporting library building) •Attributes to describe MPI objects (communicators, datatypes) •Query functionality for MPI objects (MPI_Status) •Errorhandlers to influence behavior on errors •MPI_Group‘s for manipulating ordered sets of processes
  • 33. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 „MPI is too large“ „MPI is the assembler of parallel computing“… „MPI is designed not to make easy things easy, but to make difficult things possible“ Gropp, EuroPVM/MPI 2004 Conjecture (tested at EuroPVM/MPI 2002): for any MPI feature there will be at least one (significant) user depending essentially on exactly this feature Often heard objections/complaints and two answers
  • 34. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Collective communication: patterns of process communication Fundamental, well-studied, and useful parallel communication patterns captured in MPI 1.0 as socalled collective operations: •MPI_Barrier(comm); •MPI_Bcast(…,comm); •MPI_Gather(…,comm); MPI_Scatter(…,comm); •MPI_Allgather(…,comm); •MPI_Alltoall(…,comm); •MPI_Reduce(…,comm); MPI_Allreduce(…,comm); •MPI_Reduce_scatter(…,comm); •MPI_Scan(…,comm); Semantics: all processes in comm participates; blocking; no tags
  • 35. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Collective communication: patterns of process communication Fundamental, well-studied, and useful parallel communication patterns captured in MPI 1.0 as socalled collective operations: •MPI_Barrier(comm); •MPI_Bcast(…,comm); •MPI_Gather(…,comm); MPI_Scatter(…,comm); •MPI_Allgather(…,comm); •MPI_Alltoall(…,comm); •MPI_Reduce(…,comm); MPI_Allreduce(…,comm); •MPI_Reduce_scatter(…,comm); •MPI_Scan(…,comm); Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gather dual of MPI_Scatter. Regular and irregular (vector) variants
  • 36. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Collective communication: patterns of process communication Fundamental, well-studied, and useful parallel communication patterns captured in MPI 1.0 as socalled collective operations: •MPI_Gatherv(…,comm); MPI_Scatterv(…,comm); •MPI_Allgatherv(…,comm); •MPI_Alltoallv(…,comm); •MPI_Reduce_scatter(…,comm); Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gather dual of MPI_Scatter. Regular and irregular (vector) variants
  • 37. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Collective communication: patterns of process communication Fundamental, well-studied, and useful parallel communication patterns captured in MPI 1.0 as socalled collective operations Collectives capture complex patterns, often with non-trivial algorithms and implementations: delegate work to library implementer, save work for the application programmer Obligation: MPI implementation must be of sufficiently high quality – otherwise application programmer will not use or implement own collectives This did happen(s)! For datatypes: unused for a long time
  • 38. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Collective communication: patterns of process communication Fundamental, well-studied, and useful parallel communication patterns captured in MPI 1.0 as socalled collective operations Collectives capture complex patterns, often with non-trivial algorithms and implementations: delegate work to library implementer, save work for the application programmer Completeness: MPI makes it possible to (almost) implement MPI collectives „on top of“ MPI point-to-point communication Some exceptions for reductions, MPI_Op; datatypes
  • 39. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Collective communication: patterns of process communication Fundamental, well-studied, and useful parallel communication patterns captured in MPI 1.0 as socalled collective operations Collectives capture complex patterns, often with non-trivial algorithms and implementations: delegate work to library implementer, save work for the application programmer Conjecture: well-implemented collective operations contributes significantly towards application „performance portability“ [Träff, Gropp, Thakur: Self-Consistent MPI Performance Guidelines. IEEE TPDS 21(5): 698-709, 2010]
  • 40. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Proc 0 x Proc p-1 Three algorithms for matrix-vector multiplication mxn matrix A and n-element vector y distributed evenly across p MPI processes: compute z = Ay Algorithm 1: •Row-wise matrix distribution •Each process needs full vector: MPI_Allgather(v) •Compute blocks of result vector locally
  • 41. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Proc 0 xProc p-1 Algorithm 2: •Column-wise matrix distribution •Compute local partial result vector •MPI_Reduce_scatter to sum and distribute partial results Three algorithms for matrix-vector multiplication
  • 42. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Proc 0 Proc 2c Proc c Proc c-1… x MPI_Allgather MPI_Allgather Three algorithms for matrix-vector multiplication •Algorithm 3: •Matrix distribution into blocks of m/r x n/c elements •Algorithm 1 on columns •Algorithm 2 on rows
  • 43. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Proc 0 Proc 2c Proc c Proc c-1… x MPI_Reduce_scatter MPI_Reduce_scatter MPI_Reduce_scatter p=rc Three algorithms for matrix-vector multiplication •Algorithm 3: •Matrix distribution into blocks of m/r x n/c elements •Algorithm 1 on columns •Algorithm 2 on rows Algorithm 3 is more scalable. Partitioning the set processes (new communicators) is essential! Interfaces that do support collectives on subsets of processes are not able to express Algorithm 3: case in point UPC
  • 44. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Three algorithms for matrix-vector multiplication For the „regular“ case where p divides n (and p=rc) •Regular collectives: MPI_Allgather, MPI_Reduce_scatter For the „irregular“ case •Irregular collectives: MPI_Allgatherv, MPI_Reduce_scatter MPI 1.0 defined regular/irregular versions – completeness – for all the considered collective patterns; except for MPI_Reduce_scatter Performance: irregular subsume regular counterparts; but much better algorithms are known for the regular ones
  • 45. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 [R. A. van de Geijn, J. Watts: SUMMA: scalable universal matrix multiplication algorithm. Concurrency - Practice and Experience 9(4): 255-274 (1997)] [Ernie Chan, Marcel Heimlich, Avi Purkayastha, Robert A. van de Geijn: Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19(13): 1749-1783 (2007)] [F. G. van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Ortí, G. Quintana-Ortí: The libflame Library for Dense Matrix Computations. Computing in Science and Engineering 11(6): 56-63 (2009)] A lesson: Dense Linear Algebra and (regular) collective communication as offered by MPI go hand in hand Note: Most of these collective communication algorithms are a factor 2 off from best possible
  • 46. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Another example: Integer (bucket) sort n integers in a given range [0,R-1], distributed evenly across p MPI processes: m= n/p integers per process 0 1 3 0 0 2 0 1 … 4 2 1 3 Step 1: bucket sort locally, let B[i] number of elements with key i Step 2: MPI_Allreduce(B,AllB,R,MPI_INT,MPI_SUM,comm); Step 3: MPI_Exscan(B,RelB,R,MPI_INT,MPI_SUM,comm);) B =A = Now: Element A[j] needs to go to position AllB[A[j]-1]+RelB[A[j]]+j’
  • 47. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Another example: Integer (bucket) sort n integers in a given range [0,R-1], distributed evenly across p MPI processes: m= n/p integers per process 0 1 3 0 0 2 0 1 … 4 2 1 3 Step 4: compute number of elements to be sent to each other process, sendelts[i], i=0,…,p-1 B =A = Step 5: MPI_Alltoall(sendelts,1,MPI_INT,recvelts,1,MPI_INT,comm); Step6: redistribute elements MPI_Alltoallv(A,sendelts,sdispls,…,comm);
  • 48. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Another example: Integer (bucket) sort The algorithm is stable Radixsort Choice of radix R depends on properties of network (fully connected, fat tree, mesh/torus, …) and quality of reduction/scan-algorithms The algorithm is portable (by virtue of the MPI collectives), but tuning depends on systems – concrete performance model needed, but this is outside scope of MPI Note: on strong network T(MPI_Allreduce(m)) = O(m+log p) NOT: O(mlog p)
  • 49. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 A last feature Process topologies: Specify application communication pattern (as either directed graph or Cartesian grid) to MPI library, let library assign processes to processors so as to improve communication follwing specified pattern MPI version: collective communicator construction functions, process ranks in new communicator represent new (improved) mapping And a very last: (simple) tool building support – the MPI profiling interface
  • 50. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 The mistakes •MPI_Cancel(), semantically ill-defined, difficult to implement; a concession to RT? •MPI_Rsend(); vendors got too much leverage? •MPI_Pack/Unpack; was added as an afterthought in last 1994 meetings •Some functions enforce full copy of argument (list)s into library
  • 51. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 The mistakes •MPI_Cancel(), semantically ill-defined, difficult to implement; a concession to RT? (but useful for certain patterns, e.g., double buffering, client-server-like, …) •MPI_Rsend(); vendors got too much leverage? (but advantageous in some scenarios) •MPI_Pack/Unpack; was added as an afterthought in last 1994 meetings (functionality is useful/needed, limitations in the specification) •Some functions enforce full copy of argument (list)s into library
  • 52. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Missing functionality •Datatype query functions – not possible to query/reconstruct structure specified by given datatype •Some MPI objects are not first class citizens (MPI_Aint, MPI_Op, MPI_Datatype); makes it difficult to build certain types of libraries •Reductions cannot be performed locally
  • 53. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Definition: An MPI construct is non-scalable, if memory or time overhead(*) is Ω(p), p number of processes Questions: •Are there aspects of the MPI specification that are non- scalable (forces Ω(p) memory or time)? •Are there aspects of (typical) MPI implementations that are non-scalable (*)cannot be accounted for in application Is MPI scalable? Question must distinguish between specification and implementation
  • 54. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Answer is “yes” to both questions Example: Irregular collective alltoall communication (each process exchange some data with each other process) MPI_Alltoallw(sendbuf,sendcounts[],senddispl[],sendtypes[], recvbuf,recvcounts[],recvdispls[],recvtypes[],…) takes 6 p-sized arrays (4- or 8-byte integers) ~ 5MBytes, 10% of memory on BlueGene/L Sparse usage pattern: often each process exchanges with only few neighbors, so most send/recvcounts[i]=0 MPI_Alltoallw is non-scalable Pi
  • 55. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Experiment: sendcounts[i]=0, recvcounts[i] =0 for all processes and all i Argonne Natl. Lab BlueGene/L Entails no communication [Balaji, …, Träff : MPI on millions of cores. Parallel Processing Letters 21(1): 45-60, 2011]
  • 56. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Definitely non-scalable features in MPI 1.0 •Irregular collectives: p-sized lists of counts, displacements, types •Graph topology interface: requires specification of full process topology (communication graph) by all processes (Cartesian topology interface is perfectly scalable, and much used)
  • 57. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI 2: what (almost) went wrong A number of issues/desired functionality were left open by MPI 1.0, either because of •no agreement •deadline, desire to get a consolidated standard out in time Major open issues •Parallel IO •One-sided communication •Dynamic process management were partly described in the socalled JOD: „Journal of Development“ (see www.mpi-forum.org) The challenge from PVM Challenge from SHMEM…
  • 58. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI Forum started to reconvene already in 1995 Between 1995-1997 there were 16 meetings which lead to MPI 2.0 MPI 1.0: 226 pages MPI 2.0: 356 additional pages Major new features, with new concepts: extended message passing models 1. Dynamic process management 2. One-sided communication 3. MPI-IO
  • 59. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 1. Dynamic process management: MPI 1.0 was completely static: a communicator cannot change (design principle: no MPI object can change; new objects can be created and old ones destroyed), so the number of processes in MPI_COMM_WORLD cannot change: therefore not possible to add or remove processes from a running application MPI 2.0 process management relies on inter-communicators (from MPI 1.0) to establish communication with newly started processes or already running applications •MPI_Comm_spawn •MPI_Comm_connect/MPI_Comm_accept •MPI_Intercomm_merge
  • 60. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 1. Dynamic process management: MPI 1.0 was completely static: a communicator cannot change (design principle: no MPI object can change; new objects can be created and old ones destroyed), so the number of processes in MPI_COMM_WORLD cannot change: therefore not possible to add or remove processes from a running application What if a process (in a communicator) dies? The fault- tolerance problem Most (all) MPI implementations also die – but this may be an implementation issue
  • 61. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 1. Dynamic process management: MPI 1.0 was completely static: a communicator cannot change (design principle: no MPI object can change; new objects can be created and old ones destroyed), so the number of processes in MPI_COMM_WORLD cannot change: therefore not possible to add or remove processes from a running application What if a process (in a communicator) dies? The fault- tolerance problem If implementation does not die, it might be possible to program around/isolate faults using MPI 1.0 error handlers and inter-communicators [W. Gropp, E. Lusk: Fault Tolerance in Message Passing Interface Programs. IJHPCA 18(3): 363-372, 2004]
  • 62. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 1. Dynamic process management: MPI 1.0 was completely static: a communicator cannot change (design principle: no MPI object can change; new objects can be created and old ones destroyed), so the number of processes in MPI_COMM_WORLD cannot change: therefore not possible to add or remove processes from a running application What if a process (in a communicator) dies? The fault- tolerance problem The issue is contentious&contagious…
  • 63. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 2. One-sided communication Motivations/arguments: •Expressivity/convenience: For applications where only one process may readily know with which process to communicate data, the point-to-point message-passing communication model may be inconvenient •Performance: On some architectures point-to-point communication could be inefficient; e.g. if shared-memory is available Challenge: define a model that captures the essence of one- sided communication, but can be implemented without requiring specific hardware support
  • 64. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 2. One-sided communication Challenge: define a model that captures the essence of one- sided communication, but can be implemented without requiring specific hardware support New MPI 2.0 concepts: communication window, communication epoch MPI one-sided model cleanly separates communication from synchronization; three specific synchronization mechanisms •MPI_Win_fence •MPI_Win_Start/Complete/Post/Wait •MPI_Win_lock/unlock with cleverly thought out semantics and memory model
  • 65. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 2. One-sided communication MPI one-sided model cleanly separates communication from synchronization; three specific synchronization mechanisms •MPI_Win_fence •MPI_Win_start/complete/post/wait •MPI_Win_lock/unlock with cleverly thought out semantics and memory model Unfortunately, application programmers did not seem to like it •“too complicated“ •“too rigid” •“not efficient“ •…
  • 66. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 3. MPI-IO Communication with external (disk/file) memory. Could leverage MPI concepts and implementations: •Datatypes to describe file structure •Collective communication for utilizing local file systems •Fast communication MPI datatype mechanism is essential, and the power of this concept starts to become clear MPI 2.0 introduces (inelegant!) functionality to decode a datatype = discover the structure described by datatype. Needed for MPI-IO implementation (on top of MPI) and supports library building
  • 67. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Take note: Apart from MPI-IO (ROMIO), the MPI 2.0 standardization process was not followed by prototype implementations New concept (IO only): split collectives
  • 68. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Not discussed: Thread-support/compliance, the ability of MPI to work in a threaded environment •MPI 1.0: design is (largely: exception: MPI_Probe/MPI_Recv) thread safe; recommendation that MPI implementations be thread safe (contrast: PVM design) •MPI 2.0: level of thread support can be requested and queried; an MPI library is not required to support the requested level, but returns information on the highest smaller level supported MPI_THREAD_SINGLE MPI_THREAD_FUNNELED MPI_THREAD_SERIALIZED MPI_THREAD _MULTIPLE
  • 69. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Quiet years: 1997-2006 No standardization activity from 1997 MPI 2.0 implementations •Fujitsu (claim) 1999 •NEC 2000 •mpich 2004 •OpenMPI 2005 •LAM/MPI 2005(?) •… Ca. 2006 most/many implementations support mostly full MPI 2.0 Implementations evolved and improved; MPI was an interesting topic to work on, good MPI work was/is acceptable to all parallel computing conferences (SC, IPDPS, ICPP, Euro-Par, PPoPP, SPAA) [J. L.Träff, H. Ritzdorf, R. Hempel: The Implementation of MPI-2 One- Sided Communication for the NEC SX-5. SC 2000]
  • 70. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 2012 Wien, Austria 2011 Santorini, Greece 2010 Stuttgart, Germany EuroMPI (no longer PVM) 2009 Helsinki, Finland EuroPVM/MPI 2008 Dublin, Ireland 2007 Paris, France 2006 Bonn, Germany 2005 Sorrento, Italy 2004 Budapest, Hungary 2003 Venice, Italy 2002 Linz, Austria 2001 Santorini, Greece (9.11 – did not actually take place) 2000 Balatonfüred, Hungary 1999 Barcelona, Spain 1998 Liverpool, UK 1997 Cracow, Poland Now EuroPVM/MPI 1996 Munich, Germany 1995 Lyon, France 1994 Rome, Italy EuroPVM Euro(PVM/)MPI conference series: dedicated to MPI MPIForummeetings …biased towards MPI implementa tion
  • 71. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 [Thanks to Xavier Vigouroux, Vienna 2012] Bonn 2006: discussions („Open Forum“) on restarting MPI Forum starting
  • 72. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 The MPI 2.2 – MPI 3.0 process Late 2007: MPI Forum reconvenes - again Consolidate standard: MPI 1.2 and MPI 2.0 into single standard document: MPI 2.1 (Sept. 4th, 2008) MPI 2.2 intermediate step towards 3.0 •Address scalability problems •Missing functionality •BUT preserve backwards compatibility 586 pages 623 pages
  • 73. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Some MPI 2.2 features •Addressing scalability problems: new topology interface, application communication graph is specified in a distributed fashion •Library building: MPI_Reduce_local •Missing function: regular MPI_Reduce_scatter_block •More flexible MPI_Comm_create (more in MPI 3.0: MPI_Comm_split_type) •New datatypes, e.g. MPI_AINT [T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, J. L. Träff: The scalable process topology interface of MPI 2.2. Concurrency and Computation: Practice and Experience 23(4): 293-310, 2011]
  • 74. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Some MPI 2.2 features C++ bindings (since MPI 2.0) deprecated! With the intention that they will be removed MPI_Op, MPI_Datatype still not first class citizens (datatype support is weak and cumbersome) Fortran bindings modernized and corrected
  • 75. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 •6 meetings 2012 •6 meetings 2011 •7 meetings 2010 •6 meetings 2009 •7 meetings 2008 MPI 2.1, 2.2, and 3.0 process Total: 32 meetings (and counting…) Recall: •MPI 1: 7 meetings •MPI 2.0: 16 meetings MPI Forum rules: presence at physical meetings with a history (presence at past two meetings) required to vote Requirement: new functionality must be supported by use-case and prototype implementation; backwards compatibility not strict
  • 76. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
  • 77. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI 2.2 – MPI 3.0 process had working groups on •Collectives Operations •Fault Tolerance •Fortran bindings •Generalized requests ("on hold") •Hybrid Programming •Point to point (this working group is "on hold") •Remote Memory Access •Tools •MPI subsetting ("on hold") •Backward Compatibility •Miscellaneous Items •Persistence
  • 78. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI 3.0: new features, new themes, new opportunities Major new functionalities: 1. Non-blocking collectives 2. Sparse collectives 3. New one-sided communication 4. Performance tool support Deprecated functions removed: C++ interface has gone MPI 3.0, 21. September 2012: 822 pages Implementation status: mpich should cover MPI 3.0 Performance/quality?
  • 79. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 1. Non-blocking collectives Introduced for performance (overlap) and convenience reasons Similarly to non-blocking point-to-point routines; MPI_Request object to check and enforce progress Sound semantics based on ordering, no tags Different from point-to-point (with good reason): blocking and non-blocking collectives do not mix and match: MPI_Ibcast() is incorrect with MPI_Bcast() Incomplete: non-blocking versions for some other collectives (MPI_Icomm_dup) Non-orthognal: split and non-blocking collectives
  • 80. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 2. Sparse collectives Addresses scalability problem of irregular collectives. Neighborhood specified with topology functionality MPI_Neighbor_allgather(…,comm); MPI_Neighbor_allgatherv(…,comm); MPI_Neighbor_alltoall(…,comm); MPI_Neighbor_alltoallv(…,comm); MPI_Neighbor_alltoallw(…,comm); Pi and corresponding non-blocking versions [T. Hoefler, J. L. Träff: Sparse collective operations for MPI. IPDPS 2009] Will users take up? Optimization potential?
  • 81. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 [Hoefler, Dinan, Buntinas, Balaji, Barrett, Brightwell, Gropp, Kale, Thakur: Leveraging MPI‘s one-sided communication for shared-memory programming. EuroMPI 2012, LNCS 7490, 133-141, 2012] 3. One-sided communication Model extension for better performance on hybrid/shared memory systems Atomic operations (lacking in MPI 2.0 model) Per operation local completion, MPI_Rget, MPI_Rput, … (but only for passive synchronization)
  • 82. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 4. Performance tool support Problem of MPI 1.0 allowing only one profiling interface at a time (linker interception of MPI calls) NOT solved Functionality added to query certain internals of the MPI library Will tool writers take up?
  • 83. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI at a turning point Extremely large-scale systems now appearing stretch the scalability of MPI Is MPI for exascale? •Heterogeneous? •memory constrained? •low bisection width? •unreliable? systems
  • 84. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI Forum at a turning point Attendance large enough? Attendance broad enough? MPI 2.1-MPI 3.0 process has been long and exhausting, attendance driven by implementors, relatively little input form users and applications, non-technical goals have played a role; research conducted that did not lead to useful outcome for the standard (fault tolerance, thread/hybrid support, persistence, …) Perhaps time to take a break? More meetings, smaller attendance
  • 85. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 MPI 2.2 YES 24 NO 1 ABSTAIN 0 MISSED 0 Passed September 2-4, 2009 MPI 2.1 first YES 22 NO 0 ABSTAIN 0 MISSED 0 Passed June 30-July 2, 2008 MPI 3.0 YES 17 NO/ABSTAIN 0/0 Passed MPI 2.1 second YES 23 NO 0 ABSTAIN 0 MISSED 0 Passed MPI 1.3 second YES 22 NO 1 ABSTAIN 0 MISSED 0 Passed September 3-5, 2008 The recent votes (MPI 2.1 to MPI 3.0) September 20-21, 2012 Change/discussion of rules See meetings.mpi-forum.org/secretary/
  • 86. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013 Summary: Study history and learn from it: how to do better than MPI Standardization is a major effort, has taken a lot of dedication and effort from a relatively large (but declining?) group of people and institutions/companies MPI 3.0 will raise many new implementation challenges MPI 3.0 is not the end of the (hi)story Thanks to the MPI Forum; discussion with Bill Gropp, Rusty Lusk, Rajeev Thakur, Jeff Squyres, and others