How to Troubleshoot Apps for the Modern Connected Worker
Architecture of a search engine
1. Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia
2. • Today Search means Google
• Search is a daily activity
• Search is complex
• DB are (probably) not handling text queries
• Speed and relevance are keys
• Fuzzy matching: typos!
2
Search
3. • Databases
• Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
• Strong query syntax (mostly SQL)
• Some operations scan all your documents
(missing index?)
3
Why Search engines?
4. • Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-text queries: understand what is a word
• Query execution time driven by the number of
matching documents
• And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?
5. 5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, files, ...)
Search engine
Application
6. • Input = documents
• Composed by multiple attributes (textual,
numerical, geo)
• Output = documents
• Full-text query and/or numerical filters
• Understandable results: match score (ranking) +
highlighting
6
How it works
7. • 2 distinct processes
• Indexing: storing documents in a highly
optimized way to answer queries
• Query
• Matching documents
• Ranking matched documents
7
Implementation
8. • Indexing means building an “index“ or “inverted
lists“
• A dedicated data structure optimized for search
• Input = a set of documents containing words
• Output = a set of words associated to
documents
8
Implementation: Indexing process
9. 9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index
10. • Queries
• Goal = Retrieve all documents matching a user
query
• Order results from the highest ranked to the
lowest
10
Implementation: Query process
11. 11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
• 1-word query = inverted lists intersection
13. • But how do you handle typing mistakes?
• Edit-distance algorithms (ex: Levenshtein)
!
• levenshtein(bar, baz) = 1 (substitution)!
• levenshtein(bar, br) = 1 (deletion)!
• levenshtein(bar, foobar) = 3 (addition)!
• Comparing a word with all known words
would be too costly
13
Implementation: Query process
14. 14
Implementation: Query process
• The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
15. 15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)
16. • How are the matching documents ranked?
• Number of match occurrences? TF-IDF ?
• Numerical value reflecting popularity?
• Number of typing mistakes?
• Proximity between matched words?
• …
16
Implementation: Query process