A quick overview of the history, motivation, and uses of graph modeling and graph databases in various industries. Covers a brief introduction to graph databases with an emphasis on the Tinkerpop stack and Gremlin query language. These concepts are then solidified through a hands-on lab modeling a blog engine using Titan and Gremlin.
See more at http://allthingsgraphed.com.
4. Warren Weaver
• 17th - 19th century
• Problems of simplicity
• How one element interacts with
another
• First half of 20th century
• Problem of disorganized complexity
• Many elements operating in a system
w/o regard to how they interact with
each other
• Predicted
• Problem of organized complexity
• Many elements operating in a system
taking into account how they interact
with each other
• Would require computational power
far beyond what was currently
available
Science and Complexity
1948
ENIAC (1946)
19. Types of Networks
Neuron Network of Mouse Millennium Simulation (2005)
Largest astronomical simulation ever on the structure and
evolution of galaxies in the universe.
25 TB of data and 20 million galaxies
20. Use Cases
• Recommendation engines (avoid
relational N-JOIN or self-JOIN)
• Ranking/credibility (Google’s
PageRank)
• Path finding (shortest, longest,
mutual friends)
• Social (friendship, following, key
connectors)
21. Graphs
• Node/Verticy: An entity that can have zero or more edges
connected to it.
1 2 3
• Edge: An entity which connects two nodes. May be
directed or undirected
1 2
A B
22. Adjacency Matrix
• If graph is undirected, the adjacency matrix is symmetric
• Thus, transposition of matrix is the same graph
23. Adjacency Matrix
• Some graphs have different ‘types’ or dimensions of edges
24. Property Graphs
Attribute Value
id 2
name Bob
Attribute Value
id E3
type knows
since 2013-09-01
Attribute Value
id 4
name Alice
Attribute Value
id 3
name Eve
Attribute Value
id E2
type knows
since 2013-09-01
Attribute Value
id E4
type sibling
twins true
Attribute Value
id 1
name Ivan
Attribute Value
id E1
type cousin
separation 1
25. Traversals
• Breadth-first
• 3, 2, 4, 1
• Depth-first
• 3, 2, 1, 4
• Breadth-first and
depth-first search
can be combined.
• Filtering
• Ability to filter/sort
paths in traversal
• Aggregating
• Ability to aggregate/count properties as traversal occurs and affect
traversal with result of aggregation (e.g. power-grid load distr.)
• Backtracking
• Leave marker in traversal and come back to it when certain criteria is
met in a lower step
1
2
3
4
27. Tinkerpop
• A comprehensive, open-source graph framework
(http://www.tinkerpop.com/)
Property graph
model that is DB
agnostic. A kind of
JDBC for graphs.
Data flow API for
processing graphs.
Underlying
component for
graph traversals
DSL for traversing
property graphs.
Implemented in
JSR-223.
Maps between
domain objects and
the graph’s nodes
and edges. Like
ORM for graphs.
Collection of
common graph
analysis algorithms
for property
graphs.
Exposes any
blueprints graph
via a uniform
RESTful API.
Blueprints Pipes Gremlin
Frames Furnace Rexster
28. Tinkerpop Stack
• Different components all build
on each other
• Provides abstraction from
HTTP layer, to object mapping
layer, to traversal scripting, to
pluggable graph API
• Blueprints underpins the stack
making it all DB agnostic
• Blueprints implementations:
• Neo4j, Sail, OrientDB, Dex
• *) Accumulo, ArangoDB, Bitsy,
FluxGraph, FoundationDB,
InfiniteGraph, MongoDB, Oracle-
NoSQL, TitanDB * - Implemented by 3rd party
29. Tinkerpop - Rexter
• Provides REST and binary (RexPro - grizzly) protocols
• Flexible extension model (e.g. ad-hoc Gremlin queries)
• Server-side stored procedures (Gremlin)
• Browser-based interface (Dog House)
• Command-line tool for interacting with API
• Pluggable security
• SPARQL plugin to work against Sail graphs (OpenRDF)
• More information:
https://github.com/tinkerpop/rexster/wiki
30. Tinkerpop - Furnace
• Collection of industry-standard algorithms for
traversing or analyzing graphs.
• Network generators (by clique or degree distribution)
• Search: A*, Breadth-first, Depth-first
• Shortest path
• Bellman-Ford (like Dijkstra’s but can handle neg. paths)
• PageRank
• Degree Distribution
• More information:
https://github.com/tinkerpop/furnace/wiki
31. Tinkerpop - Frames
More Information: https://github.com/tinkerpop/frames/wiki
32. Tinkerpop - Pipes
• Dataflow framework for process graphs.
• Computational step becomes a node and an edge is a
communication channel between steps.
• Pipes are then chained and nested.
• Custom pipes can be created.
• Pipe types:
• Transform – emit transformation of object
• Dozens of different types of transforms
• Filter – decide whether to include/exclude object in traversal
• ~20 different types of filters
• sideEffect – include object but produce side-effect from it
• ~15 different types of sideEffects (e.g. group, count, table, tree)
• Branch – decide which step to take next in traversal
• Several different branching options
33. Tinkerpop - Blueprints
• Like JDBC but for graphs.
• Common API for Property Graphs which are very flexible
• Foundational component for Pipes, Gremlin, Frames,
Furnace, and Rexster
• Supports transactions (if underlying DB engine does)
• Multi-threaded transactions supported
• Format readers/writers (GML, GraphML, GraphSON)
• More Information:
https://github.com/tinkerpop/blueprints/wiki
34. Tinkerpop - Gremlin
• Graph traversal scripting language.
• Works against Blueprints API and is “compiled” into
Frames data-flows.
• Both native Java and Groovy (JSR-223) supported.
• Step library (https://github.com/tinkerpop/gremlin/wiki/Gremlin-Steps)
• Transform – emit transformation of object
• Dozens of different types of transforms
• Filter – decide whether to include/exclude object in traversal
• ~20 different types of filters
• sideEffect – include object but produce side-effect from it
• ~15 different types of sideEffects (e.g. group, count, table, tree)
• Branch – decide which step to take next in traversal
• Several different branching options
35. SQL → Gremlin (secret decoder ring)
Query SQL Gremlin
Get all users select
*
from
users
g.V(‘type’,
‘user’).map()
Get user names select
name
from
users
g.V(‘type’,
‘user’).name
Get user names/ages select
name,
age
from
users
g.V(‘type’,
‘user’)
.transform(
{
[
‘name’
:
it.getProperty(‘name’),
‘age’
:
it.getProperty(‘age’)
]
})
Get distinct user ages select
distinct(age)
from
users
g.V(‘type’,
‘user’)
.age.dedup()
Get oldest user select
max(age)
from
users
g.V(‘type’,
‘user’)
.age.max()
36. SQL → Gremlin (secret decoder ring)
Query SQL Gremlin
Select by equality select
*
from
users
where
age
=
35
g.V(‘type’,
‘user’)
.has(‘age’,
35).map()
Select by comparison select
*
from
users
where
age
21
g.V(‘type’,
‘user’)
.has(‘age’,
T.gt,
21)
.map()
Select by multiple criteria select
*
from
users
where
sex
=
“M”
and
age
25
g.V(‘type’,
‘user’)
.has(‘age’,
T.gt,
25)
.has(‘sex’,
‘M’)
.map()
Order by age
(switch ‘a’ and ‘b’ to do asc)
select
*
from
users
order
by
age
desc
g.V(‘type’,
‘user’).order({
it.b.getProperty(‘age’)
=
it.a.getProperty(‘age’)
}).map()
Paging select
*
from
users
order
by
age
desc
limit
5
offset
5
g.V(‘type’,
‘user’)
.order({
it.b.getProperty(‘age’)
=
it.a.getProperty(‘age’)
})[5..10].map()
37. SQL → Gremlin (secret decoder ring)
Query SQL Gremlin
Join select
users.*
from
users
inner
join
groups
on
users.gId
=
groups.id
where
groups.name
=
“devs”
g.V(‘type’,
‘groups’)
.has(‘name’,
‘dev’)
.in(‘inGroup’).map()
Join-on-join-on-join … SELECT
TOP
(5)
[t14].[ProductName]
FROM
(SELECT
COUNT(*)
AS
[value],
[t13].[ProductName]
FROM
[customers]
AS
[t0]
CROSS
APPLY
(SELECT
[t9].[ProductName]
FROM
[orders]
AS
[t1]
CROSS
JOIN
[order
details]
AS
[t2]
INNER
JOIN
[products]
AS
[t3]
ON
[t3].[ProductID]
=
[t2].[ProductID]
CROSS
JOIN
[order
details]
AS
[t4]
INNER
JOIN
[orders]
AS
[t5]
ON
[t5].[OrderID]
=
[t4].[OrderID]
LEFT
JOIN
[customers]
AS
[t6]
ON
[t6].[CustomerID]
=
[t5].[CustomerID]
CROSS
JOIN
([orders]
AS
[t7]
CROSS
JOIN
[order
details]
AS
[t8]
INNER
JOIN
[products]
AS
[t9]
ON
[t9].[ProductID]
=
[t8].[ProductID])
WHERE
NOT
EXISTS(SELECT
NULL
AS
[EMPTY]
FROM
[orders]
AS
[t10]
CROSS
JOIN
[order
details]
AS
[t11]
INNER
JOIN
[products]
AS
[t12]
ON
[t12].[ProductID]
=
[t11].[ProductID]
WHERE
[t9].[ProductID]
=
[t12].[ProductID]
AND
[t10].[CustomerID]
=
[t0].[CustomerID]
AND
[t11].[OrderID]
=
[t10].[OrderID])
AND
[t6].[CustomerID]
[t0].[CustomerID]
AND
[t1].[CustomerID]
=
[t0].[CustomerID]
AND
[t2].[OrderID]
=
[t1].[OrderID]
AND
[t4].[ProductID]
=
[t3].[ProductID]
AND
[t7].[CustomerID]
=
[t6].[CustomerID]
AND
[t8].[OrderID]
=
[t7].[OrderID])
AS
[t13]
WHERE
[t0].[CustomerID]
=
N'ALFKI'
GROUP
BY
[t13].[ProductName])
AS
[t14]
ORDER
BY
[t14].[value]
DESC
g.V('customerId','ALFKI')
.as('customer’)
.out('ordered')
.out('contains')
.out('is')
.as('products’)
.in('is')
.in('contains')
.in('ordered')
.except('customer’)
.out('ordered')
.out('contains')
.out('is')
.except('products’)
.groupCount().cap()
.orderMap(T.decr[0..5]
.productName
41. Tinkerpop - Gremlin
//
get
verticies
known
by
marko
gremlin
g.v(1).outE('knows').inV
==v[2]
==v[4]
//
get
properties
of
verticies
known
by
marko
gremlin
g.v(1).outE('knows').inV.map
=={age=27,
name=vadas}
=={age=32,
name=josh}
//
filter
by
those
older
than
30
gremlin
g.v(1).outE('knows').inV
.filter{it.age
30}.map
=={age=32,
name=josh}
//
just
get
name
gremlin
g.v(1).outE('knows').inV
.filter{it.age
30}.name
==josh
//
find
nodes
who
‘know’
someone
older
than
30
gremlin
g.V.as('x').outE('knows').inV
.has('age',
T.gt,
30).back('x').map
=={age=29,
name=marko}
42. Tinkerpop - Gremlin
//
find
edges
with
weight
.5
gremlin
g.E.filter{it.weight
0.5}
==e[10][4-‐created-‐5]
==e[8][1-‐knows-‐4]
//
find
edges
w/
weight
.5
from
marko
gremlin
g.E.filter{it.weight
0.5}
.as('x').outV.has('name',
T.eq,
'marko')
.back('x')
==e[8][1-‐knows-‐4]
//
find
nodes
‘created’
by
other
nodes
gremlin
g.V.as('x').inE('created')
.back('x').map
=={name=lop,
lang=java}
=={name=ripple,
lang=java}
gremlin
g.E.filter{it.label
==
'created'}.inV
.dedup().map
=={name=lop,
lang=java}
=={name=ripple,
lang=java}
//
find
nodes
‘created’
by
more
than
1
node
gremlin
g.E.filter{it.label
==
'created'}
.inV.groupCount().cap()
=={v[3]=3,
v[5]=1}
//
find
nodes
‘created’
by
marko’s
friends
gremlin
g.v(1).outE('knows').inV
.outE('created').inV.map
=={name=ripple,
lang=java}
=={name=lop,
lang=java}
46. Titan Graph Database
• Optimized to work against billions of nodes
and edges
• Theoretical limitation of 2^60 edges and 1^60 nodes
• Works with several different distributed DBs
including Cassandra and HBase
• Supports many concurrent users doing
complex graph traversals simultaneously
• Native integration with Tinkerpop stack
• Supports integration with search
technologies such as Lucene and
Elasticsearch
• Created by Thinkaurelius
(http://thinkaurelius.com/)
47. Titan Distributed Architecture
• TitanDB can integrate with distributed architectures in a
few different ways
Native Remote Embedded
• Put Rexter in front to
allow RESTful access
• Connects remotely to
cluster
• Can scale size as far
as cluster can
• Possible processing
bottleneck
• TitanDB and Rexter run on
each node in the cluster
• Can run on same JVM
• Considerable
performance/scalability
improvement
• Connects remotely
to cluster (or local)
• Can scale size as
far as cluster can
• Native Titan API
• Possible
processing
bottleneck
48. Titan Indexing
• Standard index
• Internal to Titan
• Very fast but only supports exact matches
• External index
• Use indexing engine external to Titan (Lucene or Elasticsearch)
• Supports range queries
• Lucene
• Limited to only one machine (small-sized datasets)
• Also as richer set of search features (than Elasticsearch)
• Elasticsearch
• Distributed
• Not as feature-filled as Lucene
49. Distributed Titan Limitations/Gotchas
• Limitations which are present but which are scheduled to
be remedied
• Property indexes must be created before property is ever used
• Unable to drop indices
• Types cannot be changed once created
• Gotchas
• Multiple graphs on same backend requires specific configurations
per graph
• Ghost vertices – certain concurrency circumstances can leave
traces of vertices. Recommendation is to allow this and periodically
clean them up
56. “Bloggie Blog” Requirements
• Create users, posts, and comments
• Retrieve all posts for a user
• Retrieve posts by time range
• Retrieve all comments for a user
• Retrieve all comments for a post, sorted by vote
• Retrieve the top N posts, sorted by vote
• User can only vote *once* on a post or comment
57. Get Cassandra Titan
• https://github.com/thinkaurelius/titan/wiki/Downloads (0.3.2 stable)
$
$TITAN_LOCATION/bin/gremlin.sh
,,,/
(o
o)
-‐-‐-‐-‐-‐oOOo-‐(_)-‐oOOo-‐-‐-‐-‐-‐
gremlin
g
=
new
TinkerGraph();
==tinkergraph[vertices:0
edges:0]
gremlin
58. Modeling Entities (User, Post, Comment)
• There’s no one way to model this.
• General rules to follow:
• 1-N relationships can be modeled as one node with N edges pointing to
other nodes
• 1-1 relationships can be modeled as a simple edge between two nodes
• M-N relationships are just more edges
• It is important to categorize the different types of edges since many
different types of edges will connect to a single node
• Don’t shy away from attaching properties to edges. Remember that edges
are just a query-able as nodes.
• A common practice is to tend to model “actions” as edges and
“actors”/”artifacts” as nodes
• Denormalize to minimize traversals
65. User Can Only Vote Once
• Could enforce using external
unique indexes
• Or do 2-step incrementing in
gremlin (small chance of dups)
gremlin
user
=
g.v(0);
post
=
g.v(1);
if
(post.inE('postVote').outV.has(
'email',
user.email).count()
==
0)
{
g.addEdge(user,
post,
'postVote',
[date:
new
Date().getTime()]);
if
(post.getProperty('votes')
!=
null){
post.votes++;
}
else
{
post.votes
=
1;
}
}
==1
gremlin
//
same
command
above
==null
67. Areas Not Covered
• Map/Reduce
• Gremlin has its own built-in M/R API
• Indexing
• Titan currently has limitation requiring all indexes are created up-front
• Integration with other backends
• HBase, Oracle Berkeley DB, Hazelcast, Persistit
• Detailed full-text search through external indexes
• Graph analytics engine (Faunus)
• Deep dive into gremlin query language and
Groovy
• Seriously, there’s a TON there.