GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Nell’iperspazio con Rocket: il Framework Web di Rust!
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
1. UC
BERKELEY
GraphX
Graph Analytics in Spark
Ankur Dave
Graduate Student, UC Berkeley AMPLab
Joint work with Joseph Gonzalez, Reynold Xin, Daniel
Crankshaw, Michael Franklin, and Ion Stoica
15. Raw
Wikipedia
/ / /
XML
Hyperlinks PageRank Top 20 Pages
Title PR
Link Table
Title Link
Editor Graph
Community
Detection
User
Community
User Com.
Editor
Table
Editor Title
Top Communities
Com. PR..
Modern Analytics
16. Tables
Raw
Wikipedia
/ / /
XML
Hyperlinks PageRank Top 20 Pages
Title PR
Link Table
Title Link
Editor Graph
Community
Detection
User
Community
User Com.
Top Communities
Com. PR..
Editor
Table
Editor Title
17. Editor
Table
Editor Title
Raw
Wikipedia
/ / /
XML
Hyperlinks PageRank Top 20 Pages
Title PR
Link Table
Title Link
Editor Graph
Community
Detection
User
Community
User Com.
Top Communities
Com. PR..
Graphs
19. Property Graphs
Vertex Property:
• User Profile
• Current PageRank Value
Edge Property:
• Weights
• Relationships
• Timestamps
20. type
Creating a Graph (Scala)
VertexId
=
Long
Graph
val
vertices:
RDD[(VertexId,
String)]
=
sc.parallelize(List(
(1L,
“Alice”),
(2L,
“Bob”),
(3L,
“Charlie”)))
class
Edge[ED](
val
srcId:
VertexId,
val
dstId:
VertexId,
val
attr:
ED)
val
edges:
RDD[Edge[String]]
=
sc.parallelize(List(
Edge(1L,
2L,
“coworker”),
Edge(2L,
3L,
“friend”)))
val
graph
=
Graph(vertices,
edges)
1
2
3
Alice
Bob
Charlie
coworker
friend
23. The triplets view
}
class
EdgeTriplet[VD,
ED](
val
srcId:
VertexId,
val
dstId:
VertexId,
val
attr:
ED,
val
srcAttr:
VD,
val
dstAttr:
VD)
RDD
class
Graph[VD,
ED]
{
def
triplets:
RDD[EdgeTriplet[VD,
ED]]
Graph
1
2
3
Alice
Bob
Charlie
coworker
friend
srcAttr dstAttr attr
Alice coworker Bob
Bob friend Charlie
triplets
24. The subgraph transformation
class
Graph[VD,
ED]
{
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=
Boolean,
vpred:
(VertexId,
VD)
=
Boolean):
Graph[VD,
ED]
}
graph.subgraph(epred
=
(edge)
=
edge.attr
!=
“relative”)
subgraph
Graph
Alice Bob
relative
Charlie
coworker
friend
relative David
Graph
Alice Bob
Charlie
coworker
friend
David
25. The subgraph transformation
class
Graph[VD,
ED]
{
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=
Boolean,
vpred:
(VertexId,
VD)
=
Boolean):
Graph[VD,
ED]
}
graph.subgraph(vpred
=
(id,
name)
=
name
!=
“Bob”)
subgraph
Graph
Alice Bob
relative
Charlie
coworker
friend
relative David
Graph
Alice
relative
Charlie
relative David
26. Computation with mapReduceTriplets
class
Graph[VD,
ED]
{
def
mapReduceTriplets[A](
upgrade to aggregateMessages
sendMsg:
EdgeTriplet[VD,
ED]
=
Iterator[(VertexId,
A)],
mergeMsg:
(A,
A)
=
A):
RDD[(VertexId,
A)]
}
graph.mapReduceTriplets(
edge
=
Iterator(
(edge.srcId,
1),
(edge.dstId,
1)),
_
+
_)
Graph
Alice Bob
relative
Charlie
friend
coworker
relative David
mapReduceTriplets
RDD
vertex id degree
Alice 2
Bob 2
Charlie 3
David 1
in Spark 1.2.0
28. Encoding Property Graphs as RDDs
Part. 1
A D
Part. 2
Vertex
Table
(RDD)
C
A D
F
E
D
Property Graph
B A
Machine 1 Machine 2
Edge Table
(RDD)
A B
A C
B C
C D
A E
A F
E D
E F
A
B
C
D
E
F
Routing
Table
(RDD)
A
B
C
D
E
F
1
2
1
1
1
2
2
2
Vertex Cut
29. Graph System Optimizations
Specialized
Vertex-Cuts
Remote
Data-Structures
Partitioning
Caching / Mirroring
Message Combiners Active Set Tracking
31. Future of GraphX
1. Language support
a) Java API: PR #3234
b) Python API: collaborating with Intel, SPARK-3789
2. More algorithms
a) LDA (topic modeling): PR #2388
b) Correlation clustering
c) Your algorithm here?
3. Speculative
a) Streaming/time-varying graphs
b) Graph database–like queries
32. Raw
Wikipedia
/ / /
Try GraphX Today
XML Hyperlinks
PageRank
Top 20 Pages
Title PR
Text
Table
Title Body
Berkeley
subgraph