SlideShare une entreprise Scribd logo
1  sur  72
Télécharger pour lire hors ligne
FAUNUS
MARKO A. RODRIGUEZ
http://THINKAURELIUS.COM
GRAPH ANALYTICS ENGINE
Faunus is a graph analytics engine built atop the Hadoop
distributed computing platform. The graph representation is
a distributed adjacency list, whereby a vertex and its
incident edges are co-located on the same machine.
Querying a Faunus graph is possible with a MapReduce-
variant of the Gremlin graph traversal language. A Gremlin
expression compiles down to a series of MapReduce-steps
that are sequence optimized and then executed by Hadoop.
Results are stored as transformations to the input graph
(graph derivations) or computational side-effects such as
aggregates (graph statistics). Beyond querying, a collection
of input/output formats are supported which enable Faunus
to load/store graphs in the distributed graph database Titan,
various graph formats stored in HDFS, and via arbitrary
user-defined functions. This presentation will focus primarily
on Faunus, but will also review the satellite technologies
that enable it.
ABSTRACT
http://FAUNUS.THINKAURELIUS.COM
SPONSORED BY
ECCO, the Evolution, Complexity and Cognition group, is a multidisciplinary
research group, directed by Francis Heylighen. They are localized at the
Vrije Universiteit Brussel (VUB), although members are distributed across
four continents. Researchers come from a wide variety of backgrounds,
from physical science and technology to the social sciences and humanities.
The philosophy is intrinsically transdisciplinary, transcending the traditional
boundaries between "hard" and "soft" sciences, and between philosophical
foundations and practical applications.
The Big-Data Interest Group (BIGDIG) is a focus group at LANL meeting
monthly to explore big-data methods and architectures. One goal of the
group is to identify early adopters and learn from their experiences.
Furthermore, they would like involve scientists that are looking for big-
data solutions and foster collaboration with those who might provide the
needed technology. The BIGDIG group includes members from all
domains: science, security, sensing, computing, library, and more.
The EgoSystem project is creating an integrated social model of the Los Alamos National Laboratory
and its surroundings using numerous online services such as Twitter, LinkedIn, MS Academic,
Wikipedia, and more. The model is seeded with LANL PostDocs, their created artifacts and
continuously grows to encompass their relations to other people and institutions. EgoSystem is a
Director sponsored project engineered by the Digital Library Research and Prototyping Team using
Big Graph Data technology provided by Aurelius.
VERTEX
0 ID
0
name:faunus
born:2012
PROPERTIES
0
name:faunus
born:2012
EDGE
1
name:hadoop
born:2005
0
name:faunus
born:2012
ID
1
name:hadoop
born:2005
5
0
name:faunus
born:2012
LABEL
1
name:hadoop
born:2005
dependsOn
5
0
name:faunus
born:2012
PROPERTIES
1
name:hadoop
born:2005
dependsOn
since:2012
5
VERTICES + EDGES
(ELEMENTS)
0
1
2
3
VERTEX IDS
0
1
2
3
4
5
6
7
EDGE IDS
0
1
2
3
A
B
A
C
4
5
6
7
EDGE LABELS
0
1
2
3
A
B
A
C
a:b
c:d
e:f
g:h
i:j
4
5
6
7
ELEMENT
PROPERTIES
0
1
2
3
A
B
A
C
a:b
c:d
e:f
g:h
i:j
4
5
6
7
1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3
0
1
2
3
A
B
A
C
a:b
c:d
e:f
g:h
i:j
4
5
6
7
id props id props label id id props label idid label id id label id
1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3
0
1
2
3
A
B
A
C
a:b
c:d
e:f
g:h
i:j
4
5
6
7
id props
vertex
id props label id
edge
id props label id
edge
id label id
edge
id label id
edge
1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3
0
1
2
3
A
B
A
C
a:b
c:d
e:f
g:h
i:j
4
5
6
7
1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3
id props
vertex
id props label id
edge
id props label id
edge
id label id
edge
id label id
edge
incoming edges outgoing edges
0
1
3
4
5
6
7
8
9
10
11
AN ADJACENCY LIST
127.0.0.2 127.0.0.3 127.0.0.4
AN ADJACENCY LIST
+
CLUSTER
0
1
3
4
5
6
7
8
9
10
11
0
1
2
3
4
5
6
7
8
9
10
11
A DISTRIBUTED ADJACENCY LIST
127.0.0.2 127.0.0.3 127.0.0.4
Hadoop is a distributed computing platform composed of two key components:
HDFS:
A distributed file system that stores arbitrarily large files within a cluster.
MapReduce:
A parallel functional computing model for key/value pair data.
HADOOP
http://hadoop.apache.org
0
1
2
3
4
5
6
7
8
9
10
11
Structure
Process
Faunus provides graph input/output formats (structure)
and a traversal language for graphs (process).
FAUNUS AND HADOOP
127.0.0.2 127.0.0.3 127.0.0.4
PROCESSING GRAPHS
WITH FAUNUS
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12
GRAPH
OF THE GODS
* Toy graph distributed with Faunus.
faunus$
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
faunus$ bin/gremlin.sh
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
http://gremlin.tinkerpop.com
faunus$ bin/gremlin.sh
,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin>
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
faunus$ bin/gremlin.sh
,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin> hdfs.ls()
gremlin>
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
faunus$ bin/gremlin.sh
,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin> hdfs.ls()
gremlin> hdfs.copyFromLocal('graph-of-the-gods.json','graph-of-the-gods.json')
==>null
gremlin>
0
1
2
3
4
5
6
7
8
9
10
11
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
faunus$ bin/gremlin.sh
,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin> hdfs.ls()
gremlin> hdfs.copyFromLocal('graph-of-the-gods.json','graph-of-the-gods.json')
==>null
gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.json
gremlin>
0
1
2
3
4
5
6
7
8
9
10
11
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
gremlin> g = FaunusFactory.open('bin/faunus.properties')
==>faunusgraph[graphsoninputformat->graphsonoutputformat]
gremlin> g.getConf('faunus')
==>faunus.graph.input.format
=com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat
==>faunus.input.location=graph-of-the-gods.json
==>faunus.graph.output.format
=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
==>faunus.output.location=output
==>faunus.output.location.overwrite=true
==>faunus.sideeffect.output.format
=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
0
1
2
3
4
5
6
7
8
9
10
11
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
gremlin> g.V
13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1:
MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map]
13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
13/05/07 12:07:10 INFO input.FileInputFormat: Total input paths to process : 1
13/05/07 12:07:10 INFO mapred.JobClient: Running job: job_201304251105_0004
13/05/07 12:07:11 INFO mapred.JobClient: map 0% reduce 0%
...
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12
0
1
2
3
4
5
6
7
8
9
10
11
127.0.0.2 127.0.0.3 127.0.0.4
1
1
1
1
1
1
1
1
1
1
1
1
gremlin> g.V.has('type','god')
13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1:
MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map,
com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map]
13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
13/05/07 12:08:56 INFO input.FileInputFormat: Total input paths to process : 1
13/05/07 12:08:57 INFO mapred.JobClient: Running job: job_201304251105_0005
13/05/07 12:08:58 INFO mapred.JobClient: map 0% reduce 0%
...
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12
0
1
2
3
4
5
6
7
8
9
10
11
127.0.0.2 127.0.0.3 127.0.0.4
0
1
1
1
0
0
0
0
0
0
0
0
gremlin> g.V.has('type','god').in('father')
13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1:
MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map,
com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map,
com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map,
com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce]
13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
13/05/07 12:13:03 INFO input.FileInputFormat: Total input paths to process : 1
13/05/07 12:13:04 INFO mapred.JobClient: Running job: job_201304251105_0006
13/05/07 12:13:05 INFO mapred.JobClient: map 0% reduce 0%
...
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12
0
1
2
3
4
5
6
7
8
9
10
11
127.0.0.2 127.0.0.3 127.0.0.4
0
0
0
0
0
0
0
1
0
0
0
0
gremlin> g.V.has('type','god').in('father').out('mother').name
13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Compiled to 3 MapReduce job(s)
13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Executing job 1 out of 3:
MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map,
com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map,
com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map,
com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce]
13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
13/05/07 12:25:18 INFO input.FileInputFormat: Total input paths to process : 1
13/05/07 12:25:18 INFO mapred.JobClient: Running job: job_201305071220_0007
...
==>alcmene
gremlin>
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12
0
1
2
3
4
5
6
7
8
9
10
11
127.0.0.2 127.0.0.3 127.0.0.4
0
0
0
0
0
0
0
0
1
0
0
0
1
k1:v1
k2:v2 2 3 5
k1:v1
vertex edge
incoming edges
4
edge edge
outgoing edges
edge
TRAVERSAL DATA
1. A long counter denoting how many
traversers exist at the element.
-OR-
2. A list of lists denoting path history of
individual traversers at the element.
counter=
cheap
enum
erative
=
expensive
* Each element in a row
maintains traversal data as well.
k1:v1 k1:v1 k1:v1
gremlin> g.V.has('type','god').in('father').out('mother').path
13/05/07 14:37:59 WARN mapreduce.FaunusCompiler: Path calculations are enabled for
this Faunus job (space and time expensive)
13/05/07 14:37:59 INFO mapreduce.FaunusCompiler: Compiled to 3 MapReduce job(s)
13/05/07 14:37:59 INFO mapreduce.FaunusCompiler: Executing job 1 out of 3:
MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map,
com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map,
com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map,
com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce]
13/05/07 14:38:00 INFO mapred.JobClient: Running job: job_201305071220_0005
...
==>[v[1], v[7], v[8]]
gremlin>
1
6
0
3
name:tartarus
type:location
name:pluto
type:god
lives
brother
name:jupiter
type:god 2
brother name:neptune
type:god
pet
11
name:cerberus
type:monster
lives
father
name:saturn
type:titan
brother
5
name:sea
type:location
lives
4
name:sky
type:location
lives
7
father
battled
name:hercules
type:demigod
10
name:hydra
type:monster
battled
9
name:nemean
type:monster
battled
8
name:alcmene
type:human
mother
time:1 time:2 time:12
0
1
2
3
4
5
6
7
8
9
10
11
127.0.0.2 127.0.0.3 127.0.0.4
[1,7,8]
GREMLIN
GRAPH TRAVERSAL LANGUAGE
TRANSFORM FILTER SIDE-EFFECT BRANCH
t : (V [ E) ! P(V [ E) f : (V [ E) ! (V [ E [ ;) s : (V [ E)/!(V [ E)
f1 f2 f3 · · · f4
transform{}
V
id
label
out
in
outE
inE
inV
map
order
...
filter{}
has
hasNot
[0..10]
random
simplePath
back
...
sideEffect{}
groupCount
groupBy
aggregate
table
store
linkIn
linkOut
count
...
loop
copySplit
fairMerge
exhaustMerge
...
Gremlin is a functional graph language where traversals are
defined using function composition. A set of useful predefined
functions are provided with the language and generic
lambdas/closures are possible for arbitrary mappings.
http://gremlin.tinkerpop.com
EXAMPLE TRAVERSALS
g.V.has('type','person').out('attends')
.has('type','academy').name.groupCount
g.V.out.out.out.simplePath.count()
"How many people attend each academy?"
g.V.sideEffect{it.degree = it.inE('friend').count()}
.degree.groupCount
"What is the in-degree distribution of the friendship subgraph?"
"How many 3-step acyclic paths exist in the graph?"
* The only memory structure is the graph,
thus all data must be in the graph.
g.V.as('x').out('father').out('father')
.linkIn('grandfather','x')
"Derive all implicit grandfather relations in the graph."
g.V.count()
"How many vertices are in the graph?"
* Mutates the graph.
hdfs://user/ubuntu/
output/job-0/
output/job-1/
output/job-2/ {
graph*
sideeffect*
g.V.out .out .count()
<NullWritable, FaunusVertex> <NullWritable, FaunusVertex>
<NullWritable, FaunusVertex> <LongWritable, Holder<FaunusElement>>
<LongWritable,
Iterable<Holder<FaunusElement>>>
<NullWritable, FaunusVertex>
MAP ONLY STEPS
(NO REDUCE NEEDED)
MAP/REDUCE STEPS
map
map
reduce
FAUNUS DATA FLOW
valuekey
GREMLIN IN MAP/REDUCE
map(null, vertex, context) {
key = context.getConf().get('provided.key')
value = context.getConf().get('provided.value')
if(!vertex.getProperty(key).equals(value)) {
vertex.clearPaths();
}
context.write(vertex);
}
FILTER
f : (V [ E) ! (V [ E [ ;)
g.V.has('type','god')
* Most filters are map-only steps.
If the predicate returns false,
then all the path metadata is cleared from the element.
f(v)
'type'
'god'
map(null, vertex, context) {
for(e : vertex.getEdges(OUT)) {
context.write(e.getVertex(IN).id, holder('p',vertex.pathsOnly()))
}
context.write(vertex.id, holder('v',vertex))
}
reduce(long, iterable<holder> holders, context) {
vertex = new FaunusVertex(long)
for(h : holders) {
if(h.getTag() == 'v'))
vertex.addAll(h.getVertex())
else
vertex.addPaths(h.getVertex())
}
context.write(null, vertex)
}
127.0.0.4
127.0.0.3
127.0.0.2
GREMLIN IN MAP/REDUCE
t : (V [ E) ! P(V [ E)
TRANSFORM
g.V.out
* Traversals implement a reduce-side join.
map(null, vertex, context) {
key = context.getConf().get('provided.key')
context.write('graph',null,vertex)
context.write('sideeffect',
vertex.getProperty(key),vertex.getPathCount())
}
reduce(object, iterable<long> longs, context) {
sum = 0
for(l : longs) { sum += l }
context.write('sideeffect',object,sum)
}
GREMLIN IN MAP/REDUCE
SIDE-EFFECT
s : (V [ E)/!(V [ E)
g.V.type.groupCount()
s(v)
'type'
* Leverages MultipleInputs/Outputs
STRUCTURING GRAPHS
WITH FAUNUS
INPUT/OUTPUT FORMATS
SequenceFileOutputFormat
A list of serialized vertex objects in a compressed binary format.
<NullWritable,FaunusVertex>
The intermediate data format between MapReduce jobs
within a Faunus pipeline.
Fastest available format for both reading and writing.
Compressed using variable-width and prefix encodings.
gremlin> g
==>faunusgraph[graphsoninputformat->graphsonoutputformat]
gremlin> g.setGraphOutputFormat(SequenceFileOutputFormat)
==>null
gremlin> g
==>faunusgraph[graphsoninputformat->sequencefileoutputformat]
gremlin>
SequenceFileInputFormat
INPUT/OUTPUT FORMATS
GraphSONOutputFormat
A verbose JSON-based text-format. Each vertex is a single JSON document.
Easy for developers to generate. Useful for testing and examples.
Limited to JSON supported datatypes for element property values.
{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"father","_id":12,"_outV":1}]}
{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},
{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},
{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},
{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}
{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},
{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":
[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}
...
GraphSONInputFormat
* JSON specification is available at http://json.org
INPUT/OUTPUT FORMATS
faunus.graph.input.format=
com.thinkaurelius.faunus.formats.edgelist.rdf.RDFInputFormat
faunus.input.location=graph-example-1.ntriple
faunus.graph.input.rdf.format=n-triples
faunus.graph.input.rdf.as-properties=http://www.w3.org/1999/02/22-rdf-syntax-ns#type
faunus.graph.input.rdf.use-localname=true
faunus.graph.input.rdf.literal-as-property=true
RDFInputFormat
Maps popular RDF text formats to a property graph.
Configurations allow for different mappings of RDF to the property graph model.
Utilizes a MapReduce step to convert an edge-list into an adjacency list.
33^^xsd:intex:marko
foaf:age 0
uri:ex:marko
age:33
* RDF parsers provided by http://openrdf.org
INPUT/OUTPUT FORMATS
RexsterInputFormat
Rexster
{
"results": {
"_type":"vertex",
"_id":1,
"name":"tiberius",
"age":29
},
"queryTime":0.123
}
HTTP REXPRO
http://.../vertices/1
g.v(1).out('mother')
.out('mother').name
==>aurelia
Rexster is a graph server that is accessed via:
REST and a Gremlin binary protocol.
Rexster supports any Blueprints-enabled graph database.
http://rexster.tinkerpop.com
INPUT/OUTPUT FORMATS
A Gremlin script stored in HDFS (distributed cache) allows for an arbitrary parse.
def boolean read(FaunusVertex v, String line) {
parts = line.split(':');
v.reuse(Long.valueOf(parts[0]))
parts[1].split(',').each {
v.addEdge(OUT, 'linkedTo', Long.valueOf(it));
}
return true;
}
ScriptInputFormat
0:1,2,3,4
1:2,3
2:0,3,5,6
3:1,2
...
def void write(FaunusVertex vertex, DataOutput output) {
output.writeUTF(vertex.getId().toString() + ':');
Iterator<Edge> itty = vertex.getEdges(OUT).iterator()
while (itty.hasNext()) {
output.writeUTF(
itty.next().getVertex(IN).getId() + ',');
}
output.writeUTF('n');
}
ScriptOutputFormat
0:1,2,3,4
1:2,3
2:0,3,5,6
3:1,2
...
Adam Jacobs. 2009. The Pathologies of Big Data. Communications of the ACM 52, 8 (August 2009), 36-44.
doi:10.1145/1536616.1536632 http://doi.acm.org/10.1145/1536616.1536632
0
1
3
4
5
6
7
8
9
10
11
Serial Key/Value Data Structure Indexed Key/Indexed Value Data Structure
0
1
3
4
5
6
7
8
9
10
11
GLOBAL VS. LOCAL
GRAPH ANALYSIS
TITAN
DISTRIBUTED GRAPH DATABASE
Application Servers Reading/Writing Graph Data
Titan Cluster Processing Gremlin Traversals and Writes
The biggest known Titan/Cassandra cluster to date:
~120 billion edge graph stored in a 16 hi1.4xlarge machine cluster.
Ego-centric graph traversals are requested by 80 m1.large machines.
The cluster serves ~10,000 transactions a second w/ ~200ms return times.
http://titan.thinkaurelius.com
http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/
FAUNUS AND TITAN
SUPPORTED TITAN INPUT/OUTPUT FORMATS
TitanCassandraInputFormat
TitanCassandraOutputFormat
TitanHBaseInputFormat
TitanHBaseOutputFormat
FAUNUS AND TITAN
Faunus/HadoopTitan/Cassandra
INTRA-CLUSTER CONFIGURATION
Data is processed on the machine where it is located.
Limited network communication.
FAUNUS AND TITAN
INTER-CLUSTER CONFIGURATION
Graph data is offloaded to another cluster.
Repeated analysis does not interfere with production graph database.
Graph g
long counter = 0
def setup(args) {
g = TitanFactory.open('cassandra:localhost')
}
def map(vertex, args) {
g.v(vertex.id).as('x').out('father')
.out('father').linkIn('grandfather','x')
if(counter++ % 1000 == 0) g.commit()
}
FAUNUS AND TITAN
VERTEX-CENTRIC COMPUTING WITH GREMLIN
A Gremlin script is stored in HDFS (distributed cache).
Vertex long ids are pulled out of Titan (FaunusVertex with id only).
The Gremlin script is evaluated concurrently for every vertex long id.
Guaranteed co-location of Gremlin script JVM and Titan vertex.
* Provided by the Gremlin script()-step
CREDITS
PRESENTED BY
MARKO A. RODRIGUEZ
SUPPORTED BY
LOS ALAMOS NATIONAL LABORATORY
LANL RESEARCH LIBRARY
VRIJE UNIVERSITEIT BRUSSEL
MANY THANKS TO
MATTHIAS BRöCHELER
STEPHEN MALLETTE
PAVEL YASKEVICH
DAN LAROCQUE
AURELIUS COMMUNITY
TINKERPOP COMMUNITY
KETRINA YIM

Contenu connexe

Tendances

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL
OrientDB - the 2nd generation  of  (Multi-Model) NoSQLOrientDB - the 2nd generation  of  (Multi-Model) NoSQL
OrientDB - the 2nd generation of (Multi-Model) NoSQLLuigi Dell'Aquila
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-iDr. Awase Khirni Syed
 
Gremlin Queries with DataStax Enterprise Graph
Gremlin Queries with DataStax Enterprise GraphGremlin Queries with DataStax Enterprise Graph
Gremlin Queries with DataStax Enterprise GraphStephen Mallette
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BigML, Inc
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 

Tendances (20)

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL
OrientDB - the 2nd generation  of  (Multi-Model) NoSQLOrientDB - the 2nd generation  of  (Multi-Model) NoSQL
OrientDB - the 2nd generation of (Multi-Model) NoSQL
 
R programming groundup-basic-section-i
R programming groundup-basic-section-iR programming groundup-basic-section-i
R programming groundup-basic-section-i
 
Gremlin Queries with DataStax Enterprise Graph
Gremlin Queries with DataStax Enterprise GraphGremlin Queries with DataStax Enterprise Graph
Gremlin Queries with DataStax Enterprise Graph
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic Workflows
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 

En vedette

Quantum Processes in Graph Computing
Quantum Processes in Graph ComputingQuantum Processes in Graph Computing
Quantum Processes in Graph ComputingMarko Rodriguez
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaJason Plurad
 
Graph Processing Applications @ HUG
Graph Processing Applications @ HUGGraph Processing Applications @ HUG
Graph Processing Applications @ HUGPraveen Sripati
 
Addressing performance issues in titan+cassandra
Addressing performance issues in titan+cassandraAddressing performance issues in titan+cassandra
Addressing performance issues in titan+cassandraNakul Jeirath
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopJason Plurad
 
Solving Problems with Graphs
Solving Problems with GraphsSolving Problems with Graphs
Solving Problems with GraphsMarko Rodriguez
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandrajohnrjenson
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryMarko Rodriguez
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 
Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
Intro to Graph Databases Using Tinkerpop, TitanDB, and GremlinIntro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
Intro to Graph Databases Using Tinkerpop, TitanDB, and GremlinCaleb Jones
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesMax De Marzi
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 

En vedette (19)

Quantum Processes in Graph Computing
Quantum Processes in Graph ComputingQuantum Processes in Graph Computing
Quantum Processes in Graph Computing
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and Scylla
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Graph Processing Applications @ HUG
Graph Processing Applications @ HUGGraph Processing Applications @ HUG
Graph Processing Applications @ HUG
 
Addressing performance issues in titan+cassandra
Addressing performance issues in titan+cassandraAddressing performance issues in titan+cassandra
Addressing performance issues in titan+cassandra
 
Graph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPopGraph Processing with Apache TinkerPop
Graph Processing with Apache TinkerPop
 
Solving Problems with Graphs
Solving Problems with GraphsSolving Problems with Graphs
Solving Problems with Graphs
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal Machinery
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
Intro to Graph Databases Using Tinkerpop, TitanDB, and GremlinIntro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
Intro to Graph Databases Using Tinkerpop, TitanDB, and Gremlin
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 

Similaire à Faunus: Graph Analytics Engine

project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSandish Kumar H N
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersMrigendra Sharma
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overviewrahulmonikasharma
 

Similaire à Faunus: Graph Analytics Engine (20)

Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data
Big dataBig data
Big data
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
D04501036040
D04501036040D04501036040
D04501036040
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 

Plus de Marko Rodriguez

Open Problems in the Universal Graph Theory
Open Problems in the Universal Graph TheoryOpen Problems in the Universal Graph Theory
Open Problems in the Universal Graph TheoryMarko Rodriguez
 
Gremlin 101.3 On Your FM Dial
Gremlin 101.3 On Your FM DialGremlin 101.3 On Your FM Dial
Gremlin 101.3 On Your FM DialMarko Rodriguez
 
The Gremlin Graph Traversal Language
The Gremlin Graph Traversal LanguageThe Gremlin Graph Traversal Language
The Gremlin Graph Traversal LanguageMarko Rodriguez
 
The Pathology of Graph Databases
The Pathology of Graph DatabasesThe Pathology of Graph Databases
The Pathology of Graph DatabasesMarko Rodriguez
 
Traversing Graph Databases with Gremlin
Traversing Graph Databases with GremlinTraversing Graph Databases with Gremlin
Traversing Graph Databases with GremlinMarko Rodriguez
 
The Path-o-Logical Gremlin
The Path-o-Logical GremlinThe Path-o-Logical Gremlin
The Path-o-Logical GremlinMarko Rodriguez
 
The Gremlin in the Graph
The Gremlin in the GraphThe Gremlin in the Graph
The Gremlin in the GraphMarko Rodriguez
 
Memoirs of a Graph Addict: Despair to Redemption
Memoirs of a Graph Addict: Despair to RedemptionMemoirs of a Graph Addict: Despair to Redemption
Memoirs of a Graph Addict: Despair to RedemptionMarko Rodriguez
 
Graph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of DataGraph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of DataMarko Rodriguez
 
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...Marko Rodriguez
 
A Perspective on Graph Theory and Network Science
A Perspective on Graph Theory and Network ScienceA Perspective on Graph Theory and Network Science
A Perspective on Graph Theory and Network ScienceMarko Rodriguez
 
The Graph Traversal Programming Pattern
The Graph Traversal Programming PatternThe Graph Traversal Programming Pattern
The Graph Traversal Programming PatternMarko Rodriguez
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in ComputingMarko Rodriguez
 
A Model of the Scholarly Community
A Model of the Scholarly CommunityA Model of the Scholarly Community
A Model of the Scholarly CommunityMarko Rodriguez
 
General-Purpose, Internet-Scale Distributed Computing with Linked Process
General-Purpose, Internet-Scale Distributed Computing with Linked ProcessGeneral-Purpose, Internet-Scale Distributed Computing with Linked Process
General-Purpose, Internet-Scale Distributed Computing with Linked ProcessMarko Rodriguez
 
Collective Decision Making Systems: From the Ideal State to Human Eudaimonia
Collective Decision Making Systems: From the Ideal State to Human EudaimoniaCollective Decision Making Systems: From the Ideal State to Human Eudaimonia
Collective Decision Making Systems: From the Ideal State to Human EudaimoniaMarko Rodriguez
 
Distributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataDistributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataMarko Rodriguez
 
An Overview of Data Management Paradigms: Relational, Document, and Graph
An Overview of Data Management Paradigms: Relational, Document, and GraphAn Overview of Data Management Paradigms: Relational, Document, and Graph
An Overview of Data Management Paradigms: Relational, Document, and GraphMarko Rodriguez
 
Graph Databases and the Future of Large-Scale Knowledge Management
Graph Databases and the Future of Large-Scale Knowledge ManagementGraph Databases and the Future of Large-Scale Knowledge Management
Graph Databases and the Future of Large-Scale Knowledge ManagementMarko Rodriguez
 

Plus de Marko Rodriguez (20)

Open Problems in the Universal Graph Theory
Open Problems in the Universal Graph TheoryOpen Problems in the Universal Graph Theory
Open Problems in the Universal Graph Theory
 
Gremlin 101.3 On Your FM Dial
Gremlin 101.3 On Your FM DialGremlin 101.3 On Your FM Dial
Gremlin 101.3 On Your FM Dial
 
The Gremlin Graph Traversal Language
The Gremlin Graph Traversal LanguageThe Gremlin Graph Traversal Language
The Gremlin Graph Traversal Language
 
The Path Forward
The Path ForwardThe Path Forward
The Path Forward
 
The Pathology of Graph Databases
The Pathology of Graph DatabasesThe Pathology of Graph Databases
The Pathology of Graph Databases
 
Traversing Graph Databases with Gremlin
Traversing Graph Databases with GremlinTraversing Graph Databases with Gremlin
Traversing Graph Databases with Gremlin
 
The Path-o-Logical Gremlin
The Path-o-Logical GremlinThe Path-o-Logical Gremlin
The Path-o-Logical Gremlin
 
The Gremlin in the Graph
The Gremlin in the GraphThe Gremlin in the Graph
The Gremlin in the Graph
 
Memoirs of a Graph Addict: Despair to Redemption
Memoirs of a Graph Addict: Despair to RedemptionMemoirs of a Graph Addict: Despair to Redemption
Memoirs of a Graph Addict: Despair to Redemption
 
Graph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of DataGraph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of Data
 
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
 
A Perspective on Graph Theory and Network Science
A Perspective on Graph Theory and Network ScienceA Perspective on Graph Theory and Network Science
A Perspective on Graph Theory and Network Science
 
The Graph Traversal Programming Pattern
The Graph Traversal Programming PatternThe Graph Traversal Programming Pattern
The Graph Traversal Programming Pattern
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
 
A Model of the Scholarly Community
A Model of the Scholarly CommunityA Model of the Scholarly Community
A Model of the Scholarly Community
 
General-Purpose, Internet-Scale Distributed Computing with Linked Process
General-Purpose, Internet-Scale Distributed Computing with Linked ProcessGeneral-Purpose, Internet-Scale Distributed Computing with Linked Process
General-Purpose, Internet-Scale Distributed Computing with Linked Process
 
Collective Decision Making Systems: From the Ideal State to Human Eudaimonia
Collective Decision Making Systems: From the Ideal State to Human EudaimoniaCollective Decision Making Systems: From the Ideal State to Human Eudaimonia
Collective Decision Making Systems: From the Ideal State to Human Eudaimonia
 
Distributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataDistributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of Data
 
An Overview of Data Management Paradigms: Relational, Document, and Graph
An Overview of Data Management Paradigms: Relational, Document, and GraphAn Overview of Data Management Paradigms: Relational, Document, and Graph
An Overview of Data Management Paradigms: Relational, Document, and Graph
 
Graph Databases and the Future of Large-Scale Knowledge Management
Graph Databases and the Future of Large-Scale Knowledge ManagementGraph Databases and the Future of Large-Scale Knowledge Management
Graph Databases and the Future of Large-Scale Knowledge Management
 

Dernier

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Faunus: Graph Analytics Engine

  • 2. Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce- variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it. ABSTRACT http://FAUNUS.THINKAURELIUS.COM
  • 3. SPONSORED BY ECCO, the Evolution, Complexity and Cognition group, is a multidisciplinary research group, directed by Francis Heylighen. They are localized at the Vrije Universiteit Brussel (VUB), although members are distributed across four continents. Researchers come from a wide variety of backgrounds, from physical science and technology to the social sciences and humanities. The philosophy is intrinsically transdisciplinary, transcending the traditional boundaries between "hard" and "soft" sciences, and between philosophical foundations and practical applications. The Big-Data Interest Group (BIGDIG) is a focus group at LANL meeting monthly to explore big-data methods and architectures. One goal of the group is to identify early adopters and learn from their experiences. Furthermore, they would like involve scientists that are looking for big- data solutions and foster collaboration with those who might provide the needed technology. The BIGDIG group includes members from all domains: science, security, sensing, computing, library, and more. The EgoSystem project is creating an integrated social model of the Los Alamos National Laboratory and its surroundings using numerous online services such as Twitter, LinkedIn, MS Academic, Wikipedia, and more. The model is seeded with LANL PostDocs, their created artifacts and continuously grows to encompass their relations to other people and institutions. EgoSystem is a Director sponsored project engineered by the Digital Library Research and Prototyping Team using Big Graph Data technology provided by Aurelius.
  • 4.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 30. 0 1 2 3 A B A C a:b c:d e:f g:h i:j 4 5 6 7 id props id props label id id props label idid label id id label id 1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3
  • 31. 0 1 2 3 A B A C a:b c:d e:f g:h i:j 4 5 6 7 id props vertex id props label id edge id props label id edge id label id edge id label id edge 1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3
  • 32. 0 1 2 3 A B A C a:b c:d e:f g:h i:j 4 5 6 7 1 e:f 4 c:d A 2 5 B 0 6 g:h A 3 7 C 3 id props vertex id props label id edge id props label id edge id label id edge id label id edge incoming edges outgoing edges
  • 34. 127.0.0.2 127.0.0.3 127.0.0.4 AN ADJACENCY LIST + CLUSTER 0 1 3 4 5 6 7 8 9 10 11
  • 35. 0 1 2 3 4 5 6 7 8 9 10 11 A DISTRIBUTED ADJACENCY LIST 127.0.0.2 127.0.0.3 127.0.0.4
  • 36. Hadoop is a distributed computing platform composed of two key components: HDFS: A distributed file system that stores arbitrarily large files within a cluster. MapReduce: A parallel functional computing model for key/value pair data. HADOOP http://hadoop.apache.org
  • 37. 0 1 2 3 4 5 6 7 8 9 10 11 Structure Process Faunus provides graph input/output formats (structure) and a traversal language for graphs (process). FAUNUS AND HADOOP 127.0.0.2 127.0.0.3 127.0.0.4
  • 41. faunus$ bin/gremlin.sh 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4 http://gremlin.tinkerpop.com
  • 42. faunus$ bin/gremlin.sh ,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin> 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
  • 43. faunus$ bin/gremlin.sh ,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin> hdfs.ls() gremlin> 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
  • 44. faunus$ bin/gremlin.sh ,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin> hdfs.ls() gremlin> hdfs.copyFromLocal('graph-of-the-gods.json','graph-of-the-gods.json') ==>null gremlin> 0 1 2 3 4 5 6 7 8 9 10 11 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
  • 45. faunus$ bin/gremlin.sh ,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin> hdfs.ls() gremlin> hdfs.copyFromLocal('graph-of-the-gods.json','graph-of-the-gods.json') ==>null gremlin> hdfs.ls() ==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.json gremlin> 0 1 2 3 4 5 6 7 8 9 10 11 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
  • 46. gremlin> g = FaunusFactory.open('bin/faunus.properties') ==>faunusgraph[graphsoninputformat->graphsonoutputformat] gremlin> g.getConf('faunus') ==>faunus.graph.input.format =com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat ==>faunus.input.location=graph-of-the-gods.json ==>faunus.graph.output.format =com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat ==>faunus.output.location=output ==>faunus.output.location.overwrite=true ==>faunus.sideeffect.output.format =org.apache.hadoop.mapreduce.lib.output.TextOutputFormat 0 1 2 3 4 5 6 7 8 9 10 11 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 127.0.0.2 127.0.0.3 127.0.0.4
  • 47. gremlin> g.V 13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s) 13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map] 13/05/07 12:07:09 INFO mapreduce.FaunusCompiler: Job data location: output/job-0 13/05/07 12:07:10 INFO input.FileInputFormat: Total input paths to process : 1 13/05/07 12:07:10 INFO mapred.JobClient: Running job: job_201304251105_0004 13/05/07 12:07:11 INFO mapred.JobClient: map 0% reduce 0% ... 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 0 1 2 3 4 5 6 7 8 9 10 11 127.0.0.2 127.0.0.3 127.0.0.4 1 1 1 1 1 1 1 1 1 1 1 1
  • 48. gremlin> g.V.has('type','god') 13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s) 13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map] 13/05/07 12:08:55 INFO mapreduce.FaunusCompiler: Job data location: output/job-0 13/05/07 12:08:56 INFO input.FileInputFormat: Total input paths to process : 1 13/05/07 12:08:57 INFO mapred.JobClient: Running job: job_201304251105_0005 13/05/07 12:08:58 INFO mapred.JobClient: map 0% reduce 0% ... 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 0 1 2 3 4 5 6 7 8 9 10 11 127.0.0.2 127.0.0.3 127.0.0.4 0 1 1 1 0 0 0 0 0 0 0 0
  • 49. gremlin> g.V.has('type','god').in('father') 13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s) 13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce] 13/05/07 12:13:03 INFO mapreduce.FaunusCompiler: Job data location: output/job-0 13/05/07 12:13:03 INFO input.FileInputFormat: Total input paths to process : 1 13/05/07 12:13:04 INFO mapred.JobClient: Running job: job_201304251105_0006 13/05/07 12:13:05 INFO mapred.JobClient: map 0% reduce 0% ... 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 0 1 2 3 4 5 6 7 8 9 10 11 127.0.0.2 127.0.0.3 127.0.0.4 0 0 0 0 0 0 0 1 0 0 0 0
  • 50. gremlin> g.V.has('type','god').in('father').out('mother').name 13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Compiled to 3 MapReduce job(s) 13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Executing job 1 out of 3: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce] 13/05/07 12:25:18 INFO mapreduce.FaunusCompiler: Job data location: output/job-0 13/05/07 12:25:18 INFO input.FileInputFormat: Total input paths to process : 1 13/05/07 12:25:18 INFO mapred.JobClient: Running job: job_201305071220_0007 ... ==>alcmene gremlin> 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 0 1 2 3 4 5 6 7 8 9 10 11 127.0.0.2 127.0.0.3 127.0.0.4 0 0 0 0 0 0 0 0 1 0 0 0
  • 51. 1 k1:v1 k2:v2 2 3 5 k1:v1 vertex edge incoming edges 4 edge edge outgoing edges edge TRAVERSAL DATA 1. A long counter denoting how many traversers exist at the element. -OR- 2. A list of lists denoting path history of individual traversers at the element. counter= cheap enum erative = expensive * Each element in a row maintains traversal data as well. k1:v1 k1:v1 k1:v1
  • 52. gremlin> g.V.has('type','god').in('father').out('mother').path 13/05/07 14:37:59 WARN mapreduce.FaunusCompiler: Path calculations are enabled for this Faunus job (space and time expensive) 13/05/07 14:37:59 INFO mapreduce.FaunusCompiler: Compiled to 3 MapReduce job(s) 13/05/07 14:37:59 INFO mapreduce.FaunusCompiler: Executing job 1 out of 3: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Map, com.thinkaurelius.faunus.mapreduce.transform.VerticesVerticesMapReduce.Reduce] 13/05/07 14:38:00 INFO mapred.JobClient: Running job: job_201305071220_0005 ... ==>[v[1], v[7], v[8]] gremlin> 1 6 0 3 name:tartarus type:location name:pluto type:god lives brother name:jupiter type:god 2 brother name:neptune type:god pet 11 name:cerberus type:monster lives father name:saturn type:titan brother 5 name:sea type:location lives 4 name:sky type:location lives 7 father battled name:hercules type:demigod 10 name:hydra type:monster battled 9 name:nemean type:monster battled 8 name:alcmene type:human mother time:1 time:2 time:12 0 1 2 3 4 5 6 7 8 9 10 11 127.0.0.2 127.0.0.3 127.0.0.4 [1,7,8]
  • 53. GREMLIN GRAPH TRAVERSAL LANGUAGE TRANSFORM FILTER SIDE-EFFECT BRANCH t : (V [ E) ! P(V [ E) f : (V [ E) ! (V [ E [ ;) s : (V [ E)/!(V [ E) f1 f2 f3 · · · f4 transform{} V id label out in outE inE inV map order ... filter{} has hasNot [0..10] random simplePath back ... sideEffect{} groupCount groupBy aggregate table store linkIn linkOut count ... loop copySplit fairMerge exhaustMerge ... Gremlin is a functional graph language where traversals are defined using function composition. A set of useful predefined functions are provided with the language and generic lambdas/closures are possible for arbitrary mappings. http://gremlin.tinkerpop.com
  • 54. EXAMPLE TRAVERSALS g.V.has('type','person').out('attends') .has('type','academy').name.groupCount g.V.out.out.out.simplePath.count() "How many people attend each academy?" g.V.sideEffect{it.degree = it.inE('friend').count()} .degree.groupCount "What is the in-degree distribution of the friendship subgraph?" "How many 3-step acyclic paths exist in the graph?" * The only memory structure is the graph, thus all data must be in the graph. g.V.as('x').out('father').out('father') .linkIn('grandfather','x') "Derive all implicit grandfather relations in the graph." g.V.count() "How many vertices are in the graph?" * Mutates the graph.
  • 55. hdfs://user/ubuntu/ output/job-0/ output/job-1/ output/job-2/ { graph* sideeffect* g.V.out .out .count() <NullWritable, FaunusVertex> <NullWritable, FaunusVertex> <NullWritable, FaunusVertex> <LongWritable, Holder<FaunusElement>> <LongWritable, Iterable<Holder<FaunusElement>>> <NullWritable, FaunusVertex> MAP ONLY STEPS (NO REDUCE NEEDED) MAP/REDUCE STEPS map map reduce FAUNUS DATA FLOW valuekey
  • 56. GREMLIN IN MAP/REDUCE map(null, vertex, context) { key = context.getConf().get('provided.key') value = context.getConf().get('provided.value') if(!vertex.getProperty(key).equals(value)) { vertex.clearPaths(); } context.write(vertex); } FILTER f : (V [ E) ! (V [ E [ ;) g.V.has('type','god') * Most filters are map-only steps. If the predicate returns false, then all the path metadata is cleared from the element. f(v) 'type' 'god'
  • 57. map(null, vertex, context) { for(e : vertex.getEdges(OUT)) { context.write(e.getVertex(IN).id, holder('p',vertex.pathsOnly())) } context.write(vertex.id, holder('v',vertex)) } reduce(long, iterable<holder> holders, context) { vertex = new FaunusVertex(long) for(h : holders) { if(h.getTag() == 'v')) vertex.addAll(h.getVertex()) else vertex.addPaths(h.getVertex()) } context.write(null, vertex) } 127.0.0.4 127.0.0.3 127.0.0.2 GREMLIN IN MAP/REDUCE t : (V [ E) ! P(V [ E) TRANSFORM g.V.out * Traversals implement a reduce-side join.
  • 58. map(null, vertex, context) { key = context.getConf().get('provided.key') context.write('graph',null,vertex) context.write('sideeffect', vertex.getProperty(key),vertex.getPathCount()) } reduce(object, iterable<long> longs, context) { sum = 0 for(l : longs) { sum += l } context.write('sideeffect',object,sum) } GREMLIN IN MAP/REDUCE SIDE-EFFECT s : (V [ E)/!(V [ E) g.V.type.groupCount() s(v) 'type' * Leverages MultipleInputs/Outputs
  • 60. INPUT/OUTPUT FORMATS SequenceFileOutputFormat A list of serialized vertex objects in a compressed binary format. <NullWritable,FaunusVertex> The intermediate data format between MapReduce jobs within a Faunus pipeline. Fastest available format for both reading and writing. Compressed using variable-width and prefix encodings. gremlin> g ==>faunusgraph[graphsoninputformat->graphsonoutputformat] gremlin> g.setGraphOutputFormat(SequenceFileOutputFormat) ==>null gremlin> g ==>faunusgraph[graphsoninputformat->sequencefileoutputformat] gremlin> SequenceFileInputFormat
  • 61. INPUT/OUTPUT FORMATS GraphSONOutputFormat A verbose JSON-based text-format. Each vertex is a single JSON document. Easy for developers to generate. Useful for testing and examples. Limited to JSON supported datatypes for element property values. {"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"father","_id":12,"_outV":1}]} {"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4}, {"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2}, {"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3}, {"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]} {"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5}, {"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE": [{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]} ... GraphSONInputFormat * JSON specification is available at http://json.org
  • 62. INPUT/OUTPUT FORMATS faunus.graph.input.format= com.thinkaurelius.faunus.formats.edgelist.rdf.RDFInputFormat faunus.input.location=graph-example-1.ntriple faunus.graph.input.rdf.format=n-triples faunus.graph.input.rdf.as-properties=http://www.w3.org/1999/02/22-rdf-syntax-ns#type faunus.graph.input.rdf.use-localname=true faunus.graph.input.rdf.literal-as-property=true RDFInputFormat Maps popular RDF text formats to a property graph. Configurations allow for different mappings of RDF to the property graph model. Utilizes a MapReduce step to convert an edge-list into an adjacency list. 33^^xsd:intex:marko foaf:age 0 uri:ex:marko age:33 * RDF parsers provided by http://openrdf.org
  • 63. INPUT/OUTPUT FORMATS RexsterInputFormat Rexster { "results": { "_type":"vertex", "_id":1, "name":"tiberius", "age":29 }, "queryTime":0.123 } HTTP REXPRO http://.../vertices/1 g.v(1).out('mother') .out('mother').name ==>aurelia Rexster is a graph server that is accessed via: REST and a Gremlin binary protocol. Rexster supports any Blueprints-enabled graph database. http://rexster.tinkerpop.com
  • 64. INPUT/OUTPUT FORMATS A Gremlin script stored in HDFS (distributed cache) allows for an arbitrary parse. def boolean read(FaunusVertex v, String line) { parts = line.split(':'); v.reuse(Long.valueOf(parts[0])) parts[1].split(',').each { v.addEdge(OUT, 'linkedTo', Long.valueOf(it)); } return true; } ScriptInputFormat 0:1,2,3,4 1:2,3 2:0,3,5,6 3:1,2 ... def void write(FaunusVertex vertex, DataOutput output) { output.writeUTF(vertex.getId().toString() + ':'); Iterator<Edge> itty = vertex.getEdges(OUT).iterator() while (itty.hasNext()) { output.writeUTF( itty.next().getVertex(IN).getId() + ','); } output.writeUTF('n'); } ScriptOutputFormat 0:1,2,3,4 1:2,3 2:0,3,5,6 3:1,2 ...
  • 65. Adam Jacobs. 2009. The Pathologies of Big Data. Communications of the ACM 52, 8 (August 2009), 36-44. doi:10.1145/1536616.1536632 http://doi.acm.org/10.1145/1536616.1536632
  • 66. 0 1 3 4 5 6 7 8 9 10 11 Serial Key/Value Data Structure Indexed Key/Indexed Value Data Structure 0 1 3 4 5 6 7 8 9 10 11 GLOBAL VS. LOCAL GRAPH ANALYSIS
  • 67. TITAN DISTRIBUTED GRAPH DATABASE Application Servers Reading/Writing Graph Data Titan Cluster Processing Gremlin Traversals and Writes The biggest known Titan/Cassandra cluster to date: ~120 billion edge graph stored in a 16 hi1.4xlarge machine cluster. Ego-centric graph traversals are requested by 80 m1.large machines. The cluster serves ~10,000 transactions a second w/ ~200ms return times. http://titan.thinkaurelius.com http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/
  • 68. FAUNUS AND TITAN SUPPORTED TITAN INPUT/OUTPUT FORMATS TitanCassandraInputFormat TitanCassandraOutputFormat TitanHBaseInputFormat TitanHBaseOutputFormat
  • 69. FAUNUS AND TITAN Faunus/HadoopTitan/Cassandra INTRA-CLUSTER CONFIGURATION Data is processed on the machine where it is located. Limited network communication.
  • 70. FAUNUS AND TITAN INTER-CLUSTER CONFIGURATION Graph data is offloaded to another cluster. Repeated analysis does not interfere with production graph database.
  • 71. Graph g long counter = 0 def setup(args) { g = TitanFactory.open('cassandra:localhost') } def map(vertex, args) { g.v(vertex.id).as('x').out('father') .out('father').linkIn('grandfather','x') if(counter++ % 1000 == 0) g.commit() } FAUNUS AND TITAN VERTEX-CENTRIC COMPUTING WITH GREMLIN A Gremlin script is stored in HDFS (distributed cache). Vertex long ids are pulled out of Titan (FaunusVertex with id only). The Gremlin script is evaluated concurrently for every vertex long id. Guaranteed co-location of Gremlin script JVM and Titan vertex. * Provided by the Gremlin script()-step
  • 72. CREDITS PRESENTED BY MARKO A. RODRIGUEZ SUPPORTED BY LOS ALAMOS NATIONAL LABORATORY LANL RESEARCH LIBRARY VRIJE UNIVERSITEIT BRUSSEL MANY THANKS TO MATTHIAS BRöCHELER STEPHEN MALLETTE PAVEL YASKEVICH DAN LAROCQUE AURELIUS COMMUNITY TINKERPOP COMMUNITY KETRINA YIM