SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
UC 
BERKELEY 
GraphX 
Graph Analytics in Spark 
Ankur Dave 
Graduate Student, UC Berkeley AMPLab 
Joint work with Joseph Gonzalez, Reynold Xin, Daniel 
Crankshaw, Michael Franklin, and Ion Stoica
Graphs are Everywhere
Social Networks
Web Graphs
Web Graphs 
“The political blogosphere and the 2004 U.S. election: divided they blog.” 
Adamic and Glance, 2005.
User-Item Graphs 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆
Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆ 
⚙ 
♫ 
⚙ 
♫
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms 
Collaborative Filtering 
» Alternating Least Squares 
» Stochastic Gradient Descent 
» Tensor Factorization 
Structured Prediction 
» Loopy Belief Propagation 
» Max-Product Linear 
Programs 
» Gibbs Sampling 
Semi-supervised ML 
» Graph SSL 
» CoEM 
Community Detection 
» Triangle-Counting 
» K-core Decomposition 
» K-Truss 
Graph Analytics 
» PageRank 
» Personalized PageRank 
» Shortest Path 
» Graph Coloring 
Classification 
» Neural Networks
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Editor 
Table 
Editor Title 
Top Communities 
Com. PR.. 
Modern Analytics
Tables 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Editor 
Table 
Editor Title
Editor 
Table 
Editor Title 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Graphs
The GraphX API
Property Graphs 
Vertex Property: 
• User Profile 
• Current PageRank Value 
Edge Property: 
• Weights 
• Relationships 
• Timestamps
type 
Creating a Graph (Scala) 
VertexId 
= 
Long 
Graph 
val 
vertices: 
RDD[(VertexId, 
String)] 
= 
sc.parallelize(List( 
(1L, 
“Alice”), 
(2L, 
“Bob”), 
(3L, 
“Charlie”))) 
class 
Edge[ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED) 
val 
edges: 
RDD[Edge[String]] 
= 
sc.parallelize(List( 
Edge(1L, 
2L, 
“coworker”), 
Edge(2L, 
3L, 
“friend”))) 
val 
graph 
= 
Graph(vertices, 
edges) 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend
Graph Operations (Scala) 
class 
Graph[VD, 
ED] 
{ 
// 
Table 
Views 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
vertices: 
RDD[(VertexId, 
VD)] 
def 
edges: 
RDD[Edge[ED]] 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
// 
Transformations 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapVertices[VD2](f: 
(VertexId, 
VD) 
= 
VD2): 
Graph[VD2, 
ED] 
def 
mapEdges[ED2](f: 
Edge[ED] 
= 
ED2): 
Graph[VD2, 
ED] 
def 
reverse: 
Graph[VD, 
ED] 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
// 
Joins 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
outerJoinVertices[U, 
VD2] 
(tbl: 
RDD[(VertexId, 
U)]) 
(f: 
(VertexId, 
VD, 
Option[U]) 
= 
VD2): 
Graph[VD2, 
ED] 
// 
Computation 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapReduceTriplets[A]( 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)]
// 
Continued 
from 
previous 
slide 
def 
pageRank(tol: 
Double): 
Graph[Double, 
Double] 
def 
triangleCount(): 
Graph[Int, 
ED] 
def 
connectedComponents(): 
Graph[VertexId, 
ED] 
// 
...and 
more: 
org.apache.spark.graphx.lib 
} 
Built-in Algorithms (Scala) 
PageRank Triangle Count Connected 
Components
The triplets view 
} 
class 
EdgeTriplet[VD, 
ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED, 
val 
srcAttr: 
VD, 
val 
dstAttr: 
VD) 
RDD 
class 
Graph[VD, 
ED] 
{ 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
Graph 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend 
srcAttr dstAttr attr 
Alice coworker Bob 
Bob friend Charlie 
triplets
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(epred 
= 
(edge) 
= 
edge.attr 
!= 
“relative”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice Bob 
Charlie 
coworker 
friend 
David
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(vpred 
= 
(id, 
name) 
= 
name 
!= 
“Bob”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice 
relative 
Charlie 
relative David
Computation with mapReduceTriplets 
class 
Graph[VD, 
ED] 
{ 
def 
mapReduceTriplets[A]( 
upgrade to aggregateMessages 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)] 
} 
graph.mapReduceTriplets( 
edge 
= 
Iterator( 
(edge.srcId, 
1), 
(edge.dstId, 
1)), 
_ 
+ 
_) 
Graph 
Alice Bob 
relative 
Charlie 
friend 
coworker 
relative David 
mapReduceTriplets 
RDD 
vertex id degree 
Alice 2 
Bob 2 
Charlie 3 
David 1 
in Spark 1.2.0
How GraphX Works
Encoding Property Graphs as RDDs 
Part. 1 
A D 
Part. 2 
Vertex 
Table 
(RDD) 
C 
A D 
F 
E 
D 
Property Graph 
B A 
Machine 1 Machine 2 
Edge Table 
(RDD) 
A B 
A C 
B C 
C D 
A E 
A F 
E D 
E F 
A 
B 
C 
D 
E 
F 
Routing 
Table 
(RDD) 
A 
B 
C 
D 
E 
F 
1 
2 
1 
1 
1 
2 
2 
2 
Vertex Cut
Graph System Optimizations 
Specialized 
Vertex-Cuts 
Remote 
Data-Structures 
Partitioning 
Caching / Mirroring 
Message Combiners Active Set Tracking
3500 
3000 
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE 
Seconds) 
2500 
2000 
(Runtime 1500 
1000 
500 
0 
state-of-the-art graph processing systems. PageRank Benchmark 
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 
9000 
8000 
7000 
6000 
5000 
7x 18x 
4000 
3000 
2000 
1000 
0 
GraphX performs comparably to
Future of GraphX 
1. Language support 
a) Java API: PR #3234 
b) Python API: collaborating with Intel, SPARK-3789 
2. More algorithms 
a) LDA (topic modeling): PR #2388 
b) Correlation clustering 
c) Your algorithm here? 
3. Speculative 
a) Streaming/time-varying graphs 
b) Graph database–like queries
Raw 
Wikipedia 
 / /   /  
Try GraphX Today 
XML Hyperlinks 
PageRank 
Top 20 Pages 
Title PR 
Text 
Table 
Title Body 
Berkeley 
subgraph
Thanks! 
http://spark.apache.org/graphx 
ankurd@eecs.berkeley.edu 
jegonzal@eecs.berkeley.edu 
rxin@eecs.berkeley.edu 
crankshaw@eecs.berkeley.edu

Contenu connexe

Tendances

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Graph computation
Graph computationGraph computation
Graph computationSigmoid
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 

Tendances (20)

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Graph computation
Graph computationGraph computation
Graph computation
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 

Similaire à GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with GoJames Tan
 
BUILDING WHILE FLYING
BUILDING WHILE FLYINGBUILDING WHILE FLYING
BUILDING WHILE FLYINGKamal Shannak
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19MoscowJS
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 

Similaire à GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20) (20)

Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
 
DEX: Seminar Tutorial
DEX: Seminar TutorialDEX: Seminar Tutorial
DEX: Seminar Tutorial
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
3DRepo
3DRepo3DRepo
3DRepo
 
SVGD3Angular2React
SVGD3Angular2ReactSVGD3Angular2React
SVGD3Angular2React
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
BUILDING WHILE FLYING
BUILDING WHILE FLYINGBUILDING WHILE FLYING
BUILDING WHILE FLYING
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Dernier (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

  • 1. UC BERKELEY GraphX Graph Analytics in Spark Ankur Dave Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica
  • 5. Web Graphs “The political blogosphere and the 2004 U.S. election: divided they blog.” Adamic and Glance, 2005.
  • 6. User-Item Graphs ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆
  • 10. Collaborative Filtering ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⚙ ♫ ⚙ ♫
  • 14. Many Graph-Parallel Algorithms Collaborative Filtering » Alternating Least Squares » Stochastic Gradient Descent » Tensor Factorization Structured Prediction » Loopy Belief Propagation » Max-Product Linear Programs » Gibbs Sampling Semi-supervised ML » Graph SSL » CoEM Community Detection » Triangle-Counting » K-core Decomposition » K-Truss Graph Analytics » PageRank » Personalized PageRank » Shortest Path » Graph Coloring Classification » Neural Networks
  • 15. Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Editor Table Editor Title Top Communities Com. PR.. Modern Analytics
  • 16. Tables Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Editor Table Editor Title
  • 17. Editor Table Editor Title Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Graphs
  • 19. Property Graphs Vertex Property: • User Profile • Current PageRank Value Edge Property: • Weights • Relationships • Timestamps
  • 20. type Creating a Graph (Scala) VertexId = Long Graph val vertices: RDD[(VertexId, String)] = sc.parallelize(List( (1L, “Alice”), (2L, “Bob”), (3L, “Charlie”))) class Edge[ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED) val edges: RDD[Edge[String]] = sc.parallelize(List( Edge(1L, 2L, “coworker”), Edge(2L, 3L, “friend”))) val graph = Graph(vertices, edges) 1 2 3 Alice Bob Charlie coworker friend
  • 21. Graph Operations (Scala) class Graph[VD, ED] { // Table Views -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def vertices: RDD[(VertexId, VD)] def edges: RDD[Edge[ED]] def triplets: RDD[EdgeTriplet[VD, ED]] // Transformations -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapVertices[VD2](f: (VertexId, VD) = VD2): Graph[VD2, ED] def mapEdges[ED2](f: Edge[ED] = ED2): Graph[VD2, ED] def reverse: Graph[VD, ED] def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] // Joins -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def outerJoinVertices[U, VD2] (tbl: RDD[(VertexId, U)]) (f: (VertexId, VD, Option[U]) = VD2): Graph[VD2, ED] // Computation -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapReduceTriplets[A]( sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)]
  • 22. // Continued from previous slide def pageRank(tol: Double): Graph[Double, Double] def triangleCount(): Graph[Int, ED] def connectedComponents(): Graph[VertexId, ED] // ...and more: org.apache.spark.graphx.lib } Built-in Algorithms (Scala) PageRank Triangle Count Connected Components
  • 23. The triplets view } class EdgeTriplet[VD, ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED, val srcAttr: VD, val dstAttr: VD) RDD class Graph[VD, ED] { def triplets: RDD[EdgeTriplet[VD, ED]] Graph 1 2 3 Alice Bob Charlie coworker friend srcAttr dstAttr attr Alice coworker Bob Bob friend Charlie triplets
  • 24. The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(epred = (edge) = edge.attr != “relative”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice Bob Charlie coworker friend David
  • 25. The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(vpred = (id, name) = name != “Bob”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice relative Charlie relative David
  • 26. Computation with mapReduceTriplets class Graph[VD, ED] { def mapReduceTriplets[A]( upgrade to aggregateMessages sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)] } graph.mapReduceTriplets( edge = Iterator( (edge.srcId, 1), (edge.dstId, 1)), _ + _) Graph Alice Bob relative Charlie friend coworker relative David mapReduceTriplets RDD vertex id degree Alice 2 Bob 2 Charlie 3 David 1 in Spark 1.2.0
  • 28. Encoding Property Graphs as RDDs Part. 1 A D Part. 2 Vertex Table (RDD) C A D F E D Property Graph B A Machine 1 Machine 2 Edge Table (RDD) A B A C B C C D A E A F E D E F A B C D E F Routing Table (RDD) A B C D E F 1 2 1 1 1 2 2 2 Vertex Cut
  • 29. Graph System Optimizations Specialized Vertex-Cuts Remote Data-Structures Partitioning Caching / Mirroring Message Combiners Active Set Tracking
  • 30. 3500 3000 EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE Seconds) 2500 2000 (Runtime 1500 1000 500 0 state-of-the-art graph processing systems. PageRank Benchmark Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 9000 8000 7000 6000 5000 7x 18x 4000 3000 2000 1000 0 GraphX performs comparably to
  • 31. Future of GraphX 1. Language support a) Java API: PR #3234 b) Python API: collaborating with Intel, SPARK-3789 2. More algorithms a) LDA (topic modeling): PR #2388 b) Correlation clustering c) Your algorithm here? 3. Speculative a) Streaming/time-varying graphs b) Graph database–like queries
  • 32. Raw Wikipedia / / / Try GraphX Today XML Hyperlinks PageRank Top 20 Pages Title PR Text Table Title Body Berkeley subgraph
  • 33. Thanks! http://spark.apache.org/graphx ankurd@eecs.berkeley.edu jegonzal@eecs.berkeley.edu rxin@eecs.berkeley.edu crankshaw@eecs.berkeley.edu