Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009
1. Combinatorial Algorithms
for Nearest Neighbors, Near-Duplicates
and Small-World Design
Yury Lifshits
Yahoo! Research
Shengyu Zhang
The Chinese University of Hong Kong
SODA 2009
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 1 / 29
3. Similarity Search: an Example
Input: Set of objects
Task: Preprocess it
Query: New object
Task: Find the most
similar one in the dataset
4. Similarity Search: an Example
Input: Set of objects
Task: Preprocess it
Most similar
Query: New object
Task: Find the most
similar one in the dataset
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29
5. Similarity Search
Search space: object domain U, similarity
function σ
Input: database S = {p1 , . . . , pn } ⊆ U
Query: q ∈ U
Task: find argmax σ(pi , q)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29
6. Nearest Neighbors in Theory
Orchard’s Algorithm k-d-B tree
Sphere Rectangle Tree
Geometric near-neighbor access tree Excluded
middle vantage point forest mvp-tree Fixed-height
Vantage-point
fixed-queries tree AESA
tree LAESA R∗ -tree Burkhard-Keller tree BBD tree
Navigating Nets Voronoi tree Balanced aspect ratio
M-tree
tree vps -tree
Metric tree
Locality-Sensitive Hashing SS-tree
R-tree Spatial approximation tree
mb-tree Cover
Multi-vantage point tree Bisector tree
tree Generalized hyperplane tree
Hybrid tree Slim tree
k-d tree
X-tree
Spill Tree Fixed queries tree Balltree
Quadtree Octree Post-office tree
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29
8. Revision: Basic Assumptions
In theory:
Triangle inequality
Doubling dimension is o(log n)
Typical web dataset has separation effect
1/ 2 ≤ d(pi , pj ) ≤ 1
For almost all i, j :
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
9. Revision: Basic Assumptions
In theory:
Triangle inequality
Doubling dimension is o(log n)
Typical web dataset has separation effect
1/ 2 ≤ d(pi , pj ) ≤ 1
For almost all i, j :
Classic methods fail:
Branch and bound algorithms visit every object
Doubling dimension is at least log n/ 2
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
10. Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
Combinatorial framework: new approach to data
mining problems that does not require triangle
inequality
Nearest neighbor algorithm
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
11. Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
Combinatorial framework: new approach to data
mining problems that does not require triangle
inequality
Nearest neighbor algorithm
This work:
Better nearest neighbor search
Detecting near-duplicates
Navigability design for small worlds
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
12. Outline
Combinatorial Framework
1
New Algorithms
2
Combinatorial Nets
3
Directions for Further Research
4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29
14. Comparison Oracle
Dataset p1 , . . . , pn
Objects and distance (or similarity)
function are NOT given
Instead, there is a comparison oracle
answering queries of the form:
Who is closer to A: B or C?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29
15. Disorder Inequality
Sort all objects by their similarity to p:
rankp (r)
p s
r
rankp (s)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
16. Disorder Inequality
Sort all objects by their similarity to p:
rankp (r)
p s
r
rankp (s)
Then by similarity to r:
rankr (s)
s
r
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
17. Disorder Inequality
Sort all objects by their similarity to p:
rankp (r)
p s
r
rankp (s)
Then by similarity to r:
rankr (s)
s
r
Dataset has disorder D if
∀p, r, s : rankr (s) ≤ D(rankp (r) + rankp (s))
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
18. Combinatorial Framework
=
Comparison oracle
Who is closer to A: B or C?
+
Disorder inequality
rankr (s) ≤ D(rankp(r) + rankp(s))
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29
19. Combinatorial Framework: FAQ
Disorder of a metric space? Disorder of
Rk ?
In what cases disorder is relatively small?
Experimental values of D for some
practical datasets?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29
20. Disorder vs. Others
If expansion rate is c, disorder constant is
at most c2
Doubling dimension and disorder
dimension are incomparable
Disorder inequality implies combinatorial
form of “doubling effect”
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29
21. Combinatorial Framework: Pro & Contra
Advantages:
Does not require triangle inequality for distances
Applicable to any data model and any similarity
function
Require only comparative training information
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
22. Combinatorial Framework: Pro & Contra
Advantages:
Does not require triangle inequality for distances
Applicable to any data model and any similarity
function
Require only comparative training information
Limitation: worst-case form of disorder inequality
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
24. Nearest Neighbor Search
Assume S ∪ {q} has disorder constant D
Theorem
There is a deterministic and exact algorithm
for nearest neighbor search:
Preprocessing: O(D7 n log2 n)
Search: O(D4 log n)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29
25. Near-Duplicates
Assume, comparison oracle can also tell us
whether σ(x, y) > T for some similarity
threshold T
Theorem
All pairs with over-T similarity can be found
deterministically in time
poly(D)(n log2 n + |Output|)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29
26. Visibility Graph
Theorem
For any dataset S with disorder D there exists
a visibility graph:
poly(D)n log2 n construction time
O(D4 log n) out-degrees
Naïve greedy routing
deterministically reaches
exact nearest neighbor of the given target q
in at most log n steps
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29
34. Combinatorial Ball
B(x, r) = {y : rankx (y) < r}
In other words, it is a subset of dataset S: the
object x itself and r − 1 its nearest neighbors
B(x, 10)
x
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29
35. Combinatorial Net
A subset R ⊆ S is called a combinatorial
r-net iff the following two properties holds:
Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r.
Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
36. Combinatorial Net
A subset R ⊆ S is called a combinatorial
r-net iff the following two properties holds:
Covering: ∀y ∈ S, ∃x ∈ R, s.t. rankx (y) < r.
Separation: ∀xi , xj ∈ R, rankxi (xj ) ≥ r OR rankxj (xi ) ≥ r
How to construct a combinatorial net?
What upper bound on its size can we guarantee?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
37. Basic Data Structure
Combinatorial nets:
n
For every 0 ≤ i ≤ log n, construct a -net
2i
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
38. Basic Data Structure
Combinatorial nets:
n
For every 0 ≤ i ≤ log n, construct a -net
2i
Pointers, pointers, pointers:
Direct & inverted indices: links between centers
and members of their balls
Cousin links: for every center keep pointers to
close centers on the same level
Navigation links: for every center keep pointers to
close centers on the next level
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
40. Up’n’Down Trick
Assume your have 2r-net for the dataset
To compute an r-ball around some object p:
Take a center p of 2r ball that is covering p
1
Take all centers of 2r-balls nearby p
2
For all of them write down all members of theirs
3
2r-balls
Sort all written objects with respect to p and keep r
4
most similar ones.
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29
41. 4
Directions for Further Research
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29
42. Future of Combinatorial Framework
Other problems in combinatorial framework:
Low-distortion embeddings
Closest pairs
Community discovery
Linear arrangement
Distance labelling
Dimensionality reduction
What if disorder inequality has exceptions?
Insertions, deletions, changing metric
Experiments & implementation
Unification challenge: disorder + doubling = ?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29