1. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data (pp. 135-146). ACM
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik,
James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski
Pregel: A System for Large-Scale
Graph Processing
2. Source: SIGMETRICS ’09 Tutorial – MapReduce: The Programming Model and Practice, by Jerry Zhao
2
3. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
3
4. The Problem
• Many practical computing problems concern large
graphs.
Large graph data
Graph algorithms
Web graph
Transportation routes
Citation relationships
Social networks
PageRank
Shortest path
Connected components
Clustering techniques
• Efficient processing of large graphs is challenging:
Poor locality of memory access
Very little work per vertex
Changing degree of parallelism
Running over many machines makes the problem worse
4
5. Want to Process a Large Scale Graph? The Options:
1. Crafting a custom distributed infrastructure.
Substantial engineering effort.
2. Relying on an existing distributed platform: e.g.
Map Reduce.
Inefficient: Must store graph state in each state too
much communication between stages.
3. Using a single-computer graph algorithm library.
Not scalable.
4. Using an existing parallel graph system.
Not fault tolerance.
5
6. Pregel
• Google, to overcome, these challenges came up with
Pregel.
Provides scalability
Fault-tolerance
Flexibility to express arbitrary algorithms
• The high level organization of Pregel programs is
inspired by Valiant’s Bulk Synchronous Parallel
model [45].
[45] Leslie G. Valiant, A Bridging Model for Parallel Computation. Comm. ACM 33(8), 1990
6
7. Bulk Synchronous Parallel
Input
All Vote
to Halt
Output
•
•
•
•
Series of iterations (supersteps) .
Each vertex V invokes a function in parallel.
Can read messages sent in previous superstep (S-1).
Can send messages, to be read at the next superstep
(S+1).
• Can modify state of outgoing edges.
7
8. Advantage? In Vertex-Centric Approach
• Users focus on a local action.
• Processing each item independently.
• Ensures that pregel programs are inherently free of
deadlocks and data races common in asynchronous
systems.
8
9. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
9
10. Model of Computation
All Vote
to Halt
•
•
•
•
Outpu
t
A Directed Graph is given to Pregel.
It runs the computation at each vertex.
Until all nodes vote for halt.
Pregel gives you a directed graph back.
10
11. Vertex State Machine
• Algorithm termination is based on every vertex
voting to halt.
• In superstep 0, every vertex is in the active state.
• A vertex deactivates itself by voting to halt.
• It can be reactivated by receiving an (external)
message.
11
13. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
13
14. The C++ API
• Subclassing the predefined Vertex class, and writes
a Compute method.
Compute() method: which will be executed at each active
vertex in every superstep.
• Can get/set vertex value.
GetValue() / MutableValue()
• Can get/set outgoing edges values.
GetOutEdgeIterator()
• Can send/receive messages.
SendMessageTo() / Compute()
14
15. The C++ API – Vertex Class
3 value types
Override this!
in msgs
Vertex
Edge
out msg
15
16. The C++ API
Message passing:
• No guaranteed message delivery order.
• Messages are delivered exactly once.
• Can send messages to any node.
• If dest_vertex doesn’t exist, user’s function is
called.
void SendMessageTo(const string& dest_vertex,
const MessageValue& message);
16
17. The C++ API
Combiners (not active by default):
• Sending a message to another vertex that exists on a
different machine has some overhead.
• User specifies a way to reduce many messages into
one value (ala Reduce in MR).
by overriding the Combine() method.
Must be commutative and associative.
• Exceedingly useful in certain contexts (e.g., 4x
speedup on shortest-path computation).
17
18. The C++ API
Aggregators:
• A mechanism for global communication, monitoring,
and data.
Each vertex can produce a value in a superstep S for the
Aggregator to use.
The Aggregated value is available to all the vertices in
superstep S+1.
• Aggregators can be used for statistics and for global
communication.
E.g., Sum applied to out-edge count of each vertex.
generates the total number of edges in the graph and
communicate it to all the vertices.
18
19. The C++ API
Topology mutations:
• Some graph algorithms need to change the graph's
topology.
E.g. A clustering algorithm may need to replace a cluster
with a node
• Vertices can create / destroy vertices at will.
• Resolving conflicting requests:
Partial ordering:
E Remove,V Remove,V Add, E Add.
User-defined handlers:
You fix the conflicts on your own.
19
20. The C++ API
Input and output:
• It has Reader/Writer for common file formats:
Text file
Vertices in a relational DB
Rows in BigTable
• User can customize Reader/Writer for new
input/outputs.
Subclassing Reader/Writer classes.
20
21. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
21
22. Implementation
• Pregel was designed for the Google cluster
architecture.
• Persistent data is stored as files on a distributed
storage system like GFS or BigTable.
• Temporary data is stored on local disk.
• Vertices are assigned to the machines based on their
vertex-ID ( hash(ID) ) so that it can easily be
understood that which node is where.
22
23. System Architecture
• Executable is copied to many machines.
• One machine becomes the Master.
Maintains worker.
Recovers faults of workers.
Provides Web-UI monitoring tool of job progress.
• Other machines become Workers.
Processes its task.
Communicates with the other workers.
23
24. Pregel Execution
1. User programs are copied on machines.
2. One machine becomes the master.
Other computer can find the master using name service and
register themselves to it.
The master determines how many partitions the graph have
3. The master assigns one or more partitions and a
portion of user input to each worker.
4. The workers run the compute function for active
vertices and send the messages asynchronously.
There is one thread for each partition in each worker.
When the superstep is finished workers tell the master how
many vertices will be active for next superstep.
24
26. Fault Tolerance
• Checkpointing
The master periodically instructs the workers to save the
state of their partitions to persistent storage.
e.g., Vertex values, edge values, incoming messages.
• Failure detection
Using regular “ping” messages.
• Recovery
The master reassigns graph partitions to the currently
available workers.
The workers all reload their partition state from most
recent available checkpoint.
26
27. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
27
28. Application – Page Rank
• A = A given page
• T1 …. Tn = Pages that point to page A (citations)
• d = Damping factor between 0 and 1 (usually kept as
0.85)
• C(T) = number of links going out of T
• PR(A) = the PageRank of page A
PR ( A)
PR (T1 )
(1 d ) d (
C (T1 )
PR (T2 )
........
C (T2 )
PR (Tn )
)
C (Tn )
28
30. Application – Page Rank
Store and carry PageRank
class PageRankVertex
: public Vertex<double, void, double> {
public:
virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {
double sum = 0;
for (; !msgs->Done(); msgs->Next())
sum += msgs->Value();
*MutableValue() = 0.15 / NumVertices() + 0.85 * sum;
}
if (superstep() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n);
} else
VoteToHalt();
For convergence, either there is a limit on
}
the number of supersteps or aggregators
};
are used to detect convergence.
30
31. Application – Shortest Path
class ShortestPathVertex
a constant larger than
: public Vertex<int, int, int> {
any feasible distance
void Compute(MessageIterator* msgs) {
int mindist = IsSource(vertex_id()) ? 0 : INF;
In the 1st superstep, only
for (; !msgs->Done(); msgs->Next())
the source vertex will
mindist = min(mindist, msgs->Value());
update its value (from INF
if (mindist < GetValue()) {
to zero)
*MutableValue() = mindist;
OutEdgeIterator iter = GetOutEdgeIterator();
for (; !iter.Done(); iter.Next())
SendMessageTo(iter.Target(),mindist + iter.GetValue());
}
VoteToHalt();
}
};
31
41. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
41
42. Experiments
• 300 multicore commodity PCs used.
• Only running time is counted.
Checkpointing disabled.
• Measures scalability of Worker tasks.
• Measures scalability w.r.t. # of Vertices.
in binary trees and log-normal trees.
• Naïve single-source shortest paths (SSSP)
implementation.
The weight of all edges = 1
42
43. SSSP - 1 billion vertex binary tree:
# of Pregel workers varies from 50 to 800
174 s
16 times workers
↓
Speedup of 10
17.3 s
43
44. SSSP – binary trees:
varying graph sizes on 800 worker tasks
702 s
17.3 s
Graph with a low average
outdegree the runtime
Increases linearly in the
graph size.
44
45. SSSP – log-normal random graphs (mean outdegree = 127.1):
varying graph sizes on 800 worker tasks
The runtime
Increases linearly in
the graph size, too.
45
46. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
46
47. Related Work
• MapReduce
Pregel is similar in concept to MapReduce, but with a
natural graph API and much more efficient support for
iterative computations over the graph.
• Bulk Synchronous Parallel model
the Oxford BSP Library[38], Green BSP library[21], BSPlib[26]
and Paderborn University BSP library.
The scalability and fault-tolerance implementation has not been
evaluated beyond several dozen machines,
and none of them provides a graph-specific API.
47
48. Related Work
• The closest matches to Pregel are:
Parallel Boost Graph Library[22],[23]
Pregel provides fault-tolerance
CGMgraph[8]
object-oriented programming style at some performance cost
• There have been few systems reporting
experimental results for graphs at the scale of
billions of vertices.
48
49. Outline
• Introduction
• Computation Model
• Writing a Pregel Program
• System Implementation
• Applications
• Experiments
• Related Work
• Conclusion & Future Work
49
50. Conclusion & Future Work
• Pregel is a scalable and fault-tolerant platform with
an API that is sufficiently flexible to express arbitrary
graph algorithms.
• Future work
Relaxing the synchronicity of the model.
Not to wait for slower workers at inter-superstep barriers.
Assigning vertices to machines to minimize inter-machine
communication.
Caring dense graphs in which most vertices send messages
to most other vertices.
50
51. Comment
• No comparison with other systems.
• The user has to modify Pregel a lot in order to
personalize it to his/her needs.
• No failure detection is mentioned for the master,
making it a single point of failure.
51
Each cluster consists of thousands of commodity PCs organized into racks with high intra-rack bandwidth.Clusters are interconnected but distributed geographically.