Finding Top-k Similar Graphs in Graph Database @ ReadingCircle

Finding Top-K Similar
Graphs in Graph
Database@ReadingCircl
e
M1 Ishikawa Yasutaka
1

About this paper
A paper in “graph theory”
 About “graph similarity query”
 Proposing new technique for accurate answer and
reducing computational cost
Proceedings of the 15th International Conference on
Extending Database Technology - EDBT '12
 Zhu, Yuanyuan・Qin, Lu・Yu, Jeffrey Xu・Cheng, Hong
2

Outline
1. Back ground of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion
3

Outline
1. Back ground of graph theory
2. Introduction
4. The framework
8. Conclusion
4

What is “graph”?
5
Graph is denoted by 𝑔 = 𝑉, 𝐸, 𝑙
 𝑉 is a set of vertices
 𝐸 ⊆ V × 𝑉 is the set of edges
 𝑙 is a labeling function, 𝑙: 𝑉 → 𝑉
 𝑉 is a set of labels
In this paper, edges of graph have no weight

Subgraph・Supergraph
6
Given two graphs 𝑔 and 𝑔′ , If 𝑔 ⊂ 𝑔′,
 𝑔 is subgraph of 𝑔′
 𝑔′ is supergraph of 𝑔
Supergraph
Subgraph

Maximum Common Subgraph
7
If 𝑔 is a common subgraph of 𝑔1 and 𝑔2 and there is
no other common subgraph 𝑔′ of 𝑔1 and 𝑔2,such
that 𝐸 𝑔′ > |𝐸(𝑔)|, 𝑔𝑟𝑎𝑝ℎ 𝑔 is a maximum
common subgraph of two graphs
This calculation is NP-hard
𝑔𝑟𝑎𝑝ℎ 𝑔1
𝑚𝑐𝑠 𝑞

Bipartite graph
8
A graph whose vertices can be devided into two
disjoint sets 𝑈 and 𝑉
 𝑈 and 𝑉 are each independent sets
𝑈 𝑉

Matching of bipartite graph
9
If each edge has no same vertices, the edge set M is
called matching
𝑈 𝑉

Outline
1. Background of graph theory
2. Introduction
4. The framework
8. Conclusion
10

Graph query processing(1)
Using graph as query to graph Database
It has attracted much attention in recent year
 Image retrieval
 Chemical compound structure search
Query graph
GraphDB
11result graphs
querying

Graph query processing(2)
Mainly falling into two categories
 Subgraph containment search
Identify a set of graphs that contain a query graph
 Supergraph containment search
Identify a set of graphs that are contained by a query graph
Besides exact subgraph/supergraph containment
query, some studies allow a small number of edges
or nodes missing in the query result
→graph similarity search is important
12

Graph similarity search
13
Main theme of this paper
Search for the similarity of a query graph and each
graph of Database
 “Top-k similar graphs “ means k graphs that is most similar
to a query graph
Query graph
1
2
3
Top-3 similar graph

Existing graph similarity search(1)
14
Two kinds of graph similarity search in related works
 Subgraph similarity search
H.Shang,X.Lin,Y.Zhang,J.X.Yu,andW.Wang.Connected substructure
similarity search. In SIGMOD, pages 903–914, 2010.
X.Yan,P.Yu,andJ.Han.Substructuresimilaritysearchingraph
databases. In SIGMOD, pages 766–777, 2005.
 Supergraph similarity search
H.Shang,K.Zhu,X.Lin,Y.Zhang,andR.Ichise.Similaritysearch on
supergraph containment. In ICDE, pages 637–648, 2010
To calculate similarity, it is needed to define the
distance of graphs:𝑑𝑖𝑠𝑡(𝑞, 𝑔)

Existing graph similarity search(2)
15
Subgraph similarity search
 𝑑𝑖𝑠𝑡 𝑞, 𝑔 = 𝐸 𝑞 − 𝐸 𝑚𝑐𝑠 𝑞, 𝑔
Supergraph similarity search
 𝑑𝑖𝑠𝑡 𝑞, 𝑔 = 𝐸 𝑔 − 𝐸 𝑚𝑐𝑠 𝑞, 𝑔
※(maybe) these 𝑑𝑖𝑠𝑡 𝑞, 𝑔 don’t satisfy the axiom of
metric space
 𝑑𝑖𝑠𝑡 𝑞, 𝑔 ≠ 𝑑𝑖𝑠𝑡(𝑔, 𝑞)

Ex:existing similarity search(1)
16
Query 𝑞 and sample graph database 𝐷 =
{𝑔1, 𝑔2, 𝑔3}
Bold edges mean the MCS of 𝑞 and each 𝑔
B
C
C A C C
B
Query q
B
C
C D C C
B
𝑔𝑟𝑎𝑝ℎ 𝑔2 ∈ 𝐷
C B B C
B
C
C A
AA
AA
A C C
B C
C

17
If we use subgraph query (𝑑𝑖𝑠𝑡 𝑞, 𝑔 = 𝐸 𝑞 −
𝐸 𝑚𝑐𝑠 𝑞, 𝑔 ),𝑔3 will be returned as answer
𝑑𝑖𝑠𝑡 𝑞, 𝑔3 = 7 − 6 = 1
B
C
C A C C
B
Query q
B
C
C D C C
B
C B B C
B
C
C A
AA
AA
A C C
B C
C

18
If we use supergraph query (𝑑𝑖𝑠𝑡 𝑞, 𝑔 = 𝐸 𝑔 −
𝐸 𝑚𝑐𝑠 𝑞, 𝑔 ), 𝑔1 will be returned as answer
𝑑𝑖𝑠𝑡 𝑞, 𝑔1 = 3 − 2 = 1
B
C
C A C C
B
Query q
B
C
C D C C
B
C B B C
B
C
C A
AA
AA
A C C
B C
C

19
But, the best answer should be 𝑔2, from user’s
perspective
These way to calculate 𝑑𝑖𝑠𝑡 is not good
B
C
C A C C
B
Query q
B
C
C D C C
B
C B B C
B
C
C A
AA
AA
A C C
B C
C

Main contributions of this paper
20
1. Studying top-k graph similarity query processing
based on new MCS based similarity measure
2. Deriving several distance lower bounds(without
and with index) to reduce the number of MCS
computations
3. Conducting extensive performance studies on a
real dataset to test the performance of their
algorithms

Outline
2. Introduction
4. The framework
8. Conclusion
21

Definitions(1)
22
In this paper, they define the 𝑑𝑖𝑠𝑡(𝑞, 𝑔) like this
𝑑𝑖𝑠𝑡 𝑞, 𝑔 = 𝐸 𝑞 + 𝐸 𝑔 − 2 × 𝐸 𝑚𝑐𝑠 𝑞, 𝑔
※This 𝑑𝑖𝑠𝑡 𝑞, 𝑔 (maybe) satisfies the axiom of metric
space
 𝑥 = 𝑦 ⇔ 𝑑𝑖𝑠𝑡 𝑥, 𝑦 = 0
 𝑑𝑖𝑠𝑡 𝑦, 𝑥 = 𝑑𝑖𝑠𝑡(𝑥, 𝑦)
 𝑑𝑖𝑠𝑡 𝑥, 𝑦 ≥ 0
 𝑑𝑖𝑠𝑡 𝑥, 𝑦 + 𝑑𝑖𝑠𝑡 𝑦, 𝑧 ≥ 𝑑𝑖𝑠𝑡(𝑥, 𝑧)
This is important in later

Definition(2)
23
In this paper, they allow MCS of two graphs to be
disconnected
 It cat potentially capture more common substructures of
two graphs
 It also can evaluate the structure similarity of two graphs
more globally

Ex:𝒅𝒊𝒔𝒕(𝒒, 𝒈) of this paper(1)
24
Query 𝑞 and sample graph database 𝐷 = {𝑔1, 𝑔2}
Bold edges mean the common edges of 𝑞 and each
𝑔
C
C
B
B AA
A
C
C
C
B
B
C
C
B
B A
C
C
C
B
BC
C
C
B
B A
𝑔𝑟𝑎𝑝ℎ 𝑔2𝑞𝑢𝑒𝑟𝑦 𝑞

25
If we require MCS to be connected, 𝑔1 will be
returned as the answer
 𝑑𝑖𝑠𝑡 𝑞, 𝑔1 = 12 + 6 − 2 × 6 = 6
 𝑑𝑖𝑠𝑡 𝑞, 𝑔2 = 12 + 12 − 2 × 5 = 14
C
C
B
B AA
A
C
C
C
B
B
C
C
B
B A
C
C
C
B
BC
C
C
B
B A

26
If we allow MCS to be disconnected, 𝑔2 will be
returned as the answer
 𝑑𝑖𝑠𝑡 𝑞, 𝑔1 = 12 + 6 − 2 × 6 = 6
 𝑑𝑖𝑠𝑡 𝑞, 𝑔2 = 12 + 12 − 2 × 10 = 4
𝑔2 is desired result for users
C
C
B
B AA
A
C
C
C
B
B
C
C
B
B A
C
C
C
B
BC
C
C
B
B A

Outline
2. Introduction
4. The framework
8. Conclusion
27

Pruning strategy
28
As mentioned previously, computing MCS is NP-hard
problem
In this paper, they derived the lower bound of MCS
to reduce the number of MCS computations
 They didn’t make MCS computation faster
If 𝑑𝑖𝑠𝑡(𝑞, 𝑔) is no less than the largest distance of
the current top-k answers, 𝑔 is not a top-k answer
and can be pruned safety

Based algorithm(1)
29
Using max-heap Α and min-heap ℋ

Based algorithm(2)
30
If 𝑑𝑖𝑠𝑡(𝑞, 𝑔) is smaller than the top value of current
top-k answer, the 𝑑𝑖𝑠𝑡(𝑞, 𝑔) is computed and
compared with the current top value again

Outline
2. Introduction
4. The framework
8. Conclusion
31

Edge frequency based lower
bound
32
Finding the lower bound of 𝑑𝑖𝑠𝑡(𝑞, 𝑔) is equivalent
to finding the upper bound of |𝐸(𝑚𝑐𝑠 𝑞, 𝑔 )|
Denote the set of the distinct edges in g as 𝐸 𝑑(𝑔)
Denote Frequency of e as 𝑓(𝑒, 𝑔)
𝑒𝑚𝑐𝑠1 𝑞, 𝑔 =
𝑒∈𝐸 𝑑(𝑞)∪𝐸 𝑑(𝑔) min{𝑓 𝑒, 𝑞 , 𝑓(𝑒, 𝑔)}
𝑑𝑖𝑠𝑡1 𝑞, 𝑔 = 𝐸 𝑞 + 𝐸 𝑔 − 2 × 𝑒𝑚𝑐𝑠1(𝑞, 𝑔)

Ex:using the 𝒅𝒊𝒔𝒕𝟏(𝒒, 𝒈) (1)
33
The frequency of edge(A,C),(B,C),(C,C) are 4,3,6
𝑒𝑚𝑐𝑠1 𝑞, 𝑔1 = 4 + 3 + 5 = 12
𝑑𝑖𝑠𝑡1 𝑞, 𝑔1 = 13 + 12 − 2 × 12 = 1
A
CCCCCC
C
C B A
A
CCCCCC
C
C B A
A
C
C
B
CC C
CCCCCCC
AA
A
𝑞𝑢𝑒𝑟𝑦 𝑞

Ex:using the 𝒅𝒊𝒔𝒕𝟏(𝒒, 𝒈) (2)
34
𝑒𝑚𝑐𝑠1 𝑞, 𝑔2 = 3 + 3 + 6 = 12
𝑑𝑖𝑠𝑡1 𝑞, 𝑔2 = 13 + 13 − 2 × 12 = 2
In fact, these lower bound are not tight compared to
the actual 𝑑𝑖𝑠𝑡 A
CCCCCC
C
C B A
A
CCCCCC
C
C B A
A
C
C
B
CC C
CCCCCCC
AA
A

Adjacency List Based Lower
Bound(1)
35
Constracting bipartite graph 𝐵(𝑞, 𝑔)
For each pair of nodes 𝑢 ∈ 𝑉(𝑞) and 𝑣 ∈ 𝑉(𝑔),
there is an edge between 𝑏(𝑢) and 𝑏 𝑣 if 𝑙 𝑢 =
𝑙 𝑣
𝐿(𝑎𝑑𝑗(𝑢)) is a multiset consisting of all labels in the
adjacent nodes of 𝑢
A
C
B
A
𝑢
𝐿 𝑎𝑑𝑗 𝑢 = {𝐴, 𝐴, 𝐵}

Adjacency List Based Lower
Bound(2)
36
The weight of edges is defined as 𝑤 𝑏 𝑢 , 𝑏 𝑣 =
|𝐿(𝑎𝑑𝑗(𝑢)) ∩ 𝐿(𝑎𝑑𝑗(𝑣))|
𝑀(𝑞, 𝑔) is the maximum weighted bipartite
matching
𝑒𝑚𝑐𝑠2 𝑞, 𝑔 =
1
2 𝑏 𝑢 ,𝑏 𝑣 ∈𝑀 𝑞,𝑔 𝑤 𝑏 𝑢 , 𝑏 𝑣
𝑑𝑖𝑠𝑡2 𝑞, 𝑔 = 𝐸 𝑞 + 𝐸 𝑔 − 2 × 𝑒𝑚𝑐𝑠2 𝑞, 𝑔

Bipartite graph(repeated)
37
A graph whose vertices can be devided into two
disjoint sets 𝑈 and 𝑉
 𝑈 and 𝑉 are each independent sets
𝑈 𝑉

Matching of bipartite
graph(repeated)
38
If each edge has no same vertices, the edge set M is
called matching
𝑈 𝑉

Ex:using the 𝒅𝒊𝒔𝒕𝟐(𝒒, 𝒈) (1)
39
𝑒𝑚𝑐𝑠2 𝑞, 𝑔1 = 2 + 2 + 2 + 1 ÷ 2 = 3.5
𝑑𝑖𝑠𝑡2 𝑞, 𝑔1 = 4 + 5 − 2 × 3.5 = 2
C
C
B A
A
C
C
B
A
A
A
A
B
B
C
C
C
C
2
2
2
1

40
If we use 𝑒𝑚𝑐𝑠1, 𝑒𝑚𝑐𝑠1 = 1 + 1 + 1 + 1 = 4
𝑑𝑖𝑠𝑡1 𝑞, 𝑔1 = 4 + 5 − 2 × 4 = 1
C
C
B A
A
C
C
B
A
A
A
A
B
B
C
C
C
C
2
2
2
1

41
Given two graphs 𝑞, 𝑔,we have 𝑑𝑖𝑠𝑡2(𝑞, 𝑔) ≥
𝑑𝑖𝑠𝑡1(𝑞, 𝑔)
C
C
B A
A
C
C
B
A
A
A
A
B
B
C
C
C
C
2
2
2
1

Algorithm using 𝒅𝒊𝒔𝒕𝟏, 𝒅𝒊𝒔𝒕𝟐
42
The computational cost of are 𝑑𝑖𝑠𝑡 > 𝑑𝑖𝑠𝑡2 > 𝑑𝑖𝑠𝑡1
Using 𝑑𝑖𝑠𝑡1 as possible

Outline
2. Introduction
4. The framework
8. Conclusion
43

Triangle property of distance
44
Given three graph 𝑔1, 𝑔2, 𝑔3, 𝑑𝑖𝑠𝑡 𝑔1, 𝑔3 ≤
𝑑𝑖𝑠𝑡 𝑔1, 𝑔2 + 𝑑𝑖𝑠𝑡 𝑔2, 𝑔3
 If 𝑔2 and 𝑔3 are very near, 𝑑𝑖𝑠𝑡(𝑔1, 𝑔2)~dist(𝑔2, 𝑔3)
If we know 𝑑𝑖𝑠𝑡(𝑔, 𝑔′), we can compute these lower
bound
 𝑑𝑖𝑠𝑡3 𝑞, 𝑔 𝑔′ = 𝑑𝑖𝑠𝑡 𝑞, 𝑔′ − 𝑑𝑖𝑠𝑡 𝑔, 𝑔′
 𝑑𝑖𝑠𝑡4 𝑞, 𝑔 𝑔′ = 𝑑𝑖𝑠𝑡 𝑞, 𝑔′ − 𝑑𝑖𝑠𝑡(𝑔, 𝑔′)

Indexing
45
The 𝑑𝑖𝑠𝑡(𝑔, 𝑔′) can be precomputed
 But, computing all the pair need to do 𝑂(|𝐷|2) MCS
computations
Define a set of groups 𝐼 = {𝐺1, 𝐺2, … , 𝐺|𝐼|}, where
𝐺𝑖 ⊆ 𝐷, and 𝐺1 ∪ 𝐺2 ∪ ⋯ ∪ 𝐺 𝐼 = 𝐷
 There is a center graph 𝑐𝑖 ∈ 𝐺𝑖
 Precompute the 𝑑𝑖𝑠𝑡(𝑔, 𝑐𝑖), 𝑔 ∈ 𝐺𝑖
𝑔6
𝑔4 𝑔2𝑔7𝑔5𝑔1
𝑔3𝐺1 𝐺2

Algorithm using
𝒅𝒊𝒔𝒕𝟑, 𝒅𝒊𝒔𝒕𝟒,index
46
If we get the
real 𝑑𝑖𝑠𝑡(𝑞, 𝑔), update
lower bound 𝑑𝑖𝑠𝑡 by
using it

Three indexing strategy(1)
48
DPIndex
 Given the number of 𝑚, randomly pick 𝑚 graphs as 𝑚
center nodes for group. For each non-center graph 𝑔 ∈
𝐷,assign it to the nearest center
 Each graph only belongs to one group

49
OPIndex
 After selecting 𝑚 graphs in 𝐷 as centers, assign each non-
center graph 𝑔 ∈ 𝐷 to the 𝑙 nealest centers
 Allows each graph to belong to multiple groups

50
GSIndex
 Treat each graph in 𝐷 as the center
 For each center, find its nearest 𝑙 graphs in 𝐷, and putting
the 𝑙 + 1 graphs together as group

Outline
1. Background of graph thoery
2. Introduction
4. The framework
8. Conclusion
51

Overview of experiments
52
Similarity measures evaluation
 Show why the query results of subgraph/supergraph
similarity query are not good
Query performance evaluation
 Compare with noIndex and SeqScan, and compare their
three indexing techniques
Indexing cost evaluation
 Compare the cost of their three indexing

environment
53
All the algorithms were implemented using Visual
C++ 2005
Tested on a PC with 2.66GHz CPU and 3.43GB
memory running Windows XP

parameters
54
They evaluate their approaches by varying five
parameters
 𝑘:top-k value
 |𝑉(𝑞)|:the size of query graph
 𝐷 :the number of graphs in graph database
 𝑚:the number of groups m used in DPIndex and OPIndex
 𝑙:the maximum number of groups l

Similarity measures comparison
55
Experiments in three types
 Subsim: 𝐸 𝑞 − 𝐸 𝑚𝑐𝑠 𝑞, 𝑔
 Supersim: 𝐸 𝑔 − 𝐸 𝑚𝑐𝑠 𝑞, 𝑔
 Fullsim: 𝐸 𝑞 + 𝐸 𝑔 − 2 × 𝐸 𝑚𝑐𝑠 𝑞, 𝑔
The near the answers and
query graph in size,
the better the answers are

Power of pruning strategy
56
Seqscan needs around 7000 MCS computation for
graph with size larger than 10
noIndex needs no more than 500

Scalability testing
57
Comparing their three index teqnique

Index testing
58
Comparing the cost of three index teqnique

Outline
2. Introduction
4. The framework
8. Conclusion
59

Conclusion
60
Existing solutions:subgraph/supergraph similarity
search cannot be used to solve problem properly
They introduced a new graph distance using the
maximum common subgraph(MCS)
In order to reduce the number of MCS computation,
they proposed two distance lower bounds
They further introduced a triangle property to lower
bound
They conducted extensive performance studies

Finding Top-k Similar Graphs in Graph Database @ ReadingCircle

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Finding Top-k Similar Graphs in Graph Database @ ReadingCircle

Similaire à Finding Top-k Similar Graphs in Graph Database @ ReadingCircle (20)

Dernier

Dernier (20)

Finding Top-k Similar Graphs in Graph Database @ ReadingCircle