SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia
• Today Search means Google
• Search is a daily activity
• Search is complex
• DB are (probably) not handling text queries
• Speed and relevance are keys
• Fuzzy matching: typos!
2
Search
• Databases
• Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
• Strong query syntax (mostly SQL)
• Some operations scan all your documents
(missing index?)
3
Why Search engines?
• Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-text queries: understand what is a word
• Query execution time driven by the number of
matching documents
• And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?
5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, files, ...)
Search engine
Application
• Input = documents
• Composed by multiple attributes (textual,
numerical, geo)
• Output = documents
• Full-text query and/or numerical filters
• Understandable results: match score (ranking) +
highlighting
6
How it works
• 2 distinct processes
• Indexing: storing documents in a highly
optimized way to answer queries
• Query
• Matching documents
• Ranking matched documents
7
Implementation
• Indexing means building an “index“ or “inverted
lists“
• A dedicated data structure optimized for search
• Input = a set of documents containing words
• Output = a set of words associated to
documents
8
Implementation: Indexing process
9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index
• Queries
• Goal = Retrieve all documents matching a user
query
• Order results from the highest ranked to the
lowest
10
Implementation: Query process
11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
• 1-word query = inverted lists intersection
12
Implementation: Query process
• N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching
documents
Intersect inverted
lists
Pagination
• But how do you handle typing mistakes?
• Edit-distance algorithms (ex: Levenshtein)
!
• levenshtein(bar, baz) = 1 (substitution)!
• levenshtein(bar, br) = 1 (deletion)!
• levenshtein(bar, foobar) = 3 (addition)!
• Comparing a word with all known words
would be too costly
13
Implementation: Query process
14
Implementation: Query process
• The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)
• How are the matching documents ranked?
• Number of match occurrences? TF-IDF ?
• Numerical value reflecting popularity?
• Number of typing mistakes?
• Proximity between matched words?
• …
16
Implementation: Query process
17
Several implementations
• What I didn’t speak about:
• Numerical/Geo queries (Including operators)
• Advanced query syntax (boolean operators, proximity
operators)
• Faceting & Aggregations (Categorization)
• Sharding (Horizontal scalability)
• Incremental indexing (Generational data structures)
• … (see u next time)
18
Missing subjects
Q/A
Now or later sylvain@algolia.com

Contenu connexe

Tendances

Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBasha Chand
 
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Gyanmanjari Institute Of Technology
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3alaa223
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on irPrimya Tamil
 

Tendances (20)

Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
IR
IRIR
IR
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
 
XML
XMLXML
XML
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 
Indexing
IndexingIndexing
Indexing
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 

En vedette

Basics of Search Engines and Algorithms
Basics of Search Engines and AlgorithmsBasics of Search Engines and Algorithms
Basics of Search Engines and AlgorithmsWeb Trainings Academy
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
13 steps to a faster jekyll website
13 steps to a faster jekyll website13 steps to a faster jekyll website
13 steps to a faster jekyll websiteRonan Berder
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks
 
Google Architecture - Breaking it Open
Google Architecture - Breaking it OpenGoogle Architecture - Breaking it Open
Google Architecture - Breaking it OpenHARMAN Services
 
SEO Architecture
SEO ArchitectureSEO Architecture
SEO ArchitectureJohn Sisler
 
#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...
#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...
#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...Altima x Konversion
 
Concept search for e commerce with solr
Concept search for e commerce with solrConcept search for e commerce with solr
Concept search for e commerce with solrlucenerevolution
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Machine Learning for Search at LinkedIn
Machine Learning for Search at LinkedInMachine Learning for Search at LinkedIn
Machine Learning for Search at LinkedInViet Ha-Thuc
 
eCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of SearcheCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of SearchElizabeth Marsten
 
Learning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional NetworksLearning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional NetworksViet Ha-Thuc
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - SearchBillCavaUs
 
Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN. Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN. Eric Enge
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchSearch Technologies
 
Building effective landing page and Calls-to-action for conversion
Building effective landing page and Calls-to-action for conversionBuilding effective landing page and Calls-to-action for conversion
Building effective landing page and Calls-to-action for conversionAyushma Pandey
 

En vedette (20)

Basics of Search Engines and Algorithms
Basics of Search Engines and AlgorithmsBasics of Search Engines and Algorithms
Basics of Search Engines and Algorithms
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
13 steps to a faster jekyll website
13 steps to a faster jekyll website13 steps to a faster jekyll website
13 steps to a faster jekyll website
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Google Architecture - Breaking it Open
Google Architecture - Breaking it OpenGoogle Architecture - Breaking it Open
Google Architecture - Breaking it Open
 
Search engines
Search enginesSearch engines
Search engines
 
SEO Architecture
SEO ArchitectureSEO Architecture
SEO Architecture
 
#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...
#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...
#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...
 
Concept search for e commerce with solr
Concept search for e commerce with solrConcept search for e commerce with solr
Concept search for e commerce with solr
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Humming bird doc (1)
Humming bird doc (1)Humming bird doc (1)
Humming bird doc (1)
 
Machine Learning for Search at LinkedIn
Machine Learning for Search at LinkedInMachine Learning for Search at LinkedIn
Machine Learning for Search at LinkedIn
 
eCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of SearcheCommerce for Everyone: What to Expect in 2017 - State of Search
eCommerce for Everyone: What to Expect in 2017 - State of Search
 
Learning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional NetworksLearning to Rank Personalized Search Results in Professional Networks
Learning to Rank Personalized Search Results in Professional Networks
 
Ektron 8.5 RC - Search
Ektron 8.5 RC - SearchEktron 8.5 RC - Search
Ektron 8.5 RC - Search
 
Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN. Machine Learning Search and SEO - Zenith; Duluth, MN.
Machine Learning Search and SEO - Zenith; Duluth, MN.
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
Building effective landing page and Calls-to-action for conversion
Building effective landing page and Calls-to-action for conversionBuilding effective landing page and Calls-to-action for conversion
Building effective landing page and Calls-to-action for conversion
 
Google Panda and SEO
Google Panda and SEOGoogle Panda and SEO
Google Panda and SEO
 

Similaire à Architecture of a search engine

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseMatthias Wahl
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and MRégis Baccaro
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseJesse Yates
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)I Goo Lee.
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 

Similaire à Architecture of a search engine (20)

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational Database
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and M
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Text Mining
Text MiningText Mining
Text Mining
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Architecture of a search engine

  • 1. Architecture of a Search Engine Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia
  • 2. • Today Search means Google • Search is a daily activity • Search is complex • DB are (probably) not handling text queries • Speed and relevance are keys • Fuzzy matching: typos! 2 Search
  • 3. • Databases • Optimized for INSERT/UPDATE/DELETE/ SELECT (that's a lot) • Strong query syntax (mostly SQL) • Some operations scan all your documents (missing index?) 3 Why Search engines?
  • 4. • Search engines • HIGHLY optimized for “SELECT” (only) • Full-text queries: understand what is a word • Query execution time driven by the number of matching documents • And obviously, “LIKE '%foo bar%’" is not full- text search 4 Why Search engines?
  • 5. 5 Why Search engines? Search Push data periodically or in realtime Full-text search Primary storage (DB, files, ...) Search engine Application
  • 6. • Input = documents • Composed by multiple attributes (textual, numerical, geo) • Output = documents • Full-text query and/or numerical filters • Understandable results: match score (ranking) + highlighting 6 How it works
  • 7. • 2 distinct processes • Indexing: storing documents in a highly optimized way to answer queries • Query • Matching documents • Ranking matched documents 7 Implementation
  • 8. • Indexing means building an “index“ or “inverted lists“ • A dedicated data structure optimized for search • Input = a set of documents containing words • Output = a set of words associated to documents 8 Implementation: Indexing process
  • 9. 9 Implementation: Indexing process foo bar baz Doc 1 bar foo Doc 2 baz baz qux Doc 3 foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Indexing Inverted lists Documents Index
  • 10. • Queries • Goal = Retrieve all documents matching a user query • Order results from the highest ranked to the lowest 10 Implementation: Query process
  • 11. 11 Implementation: Query process foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz" Sort matching documents Pagination • 1-word query = inverted lists intersection
  • 12. 12 Implementation: Query process • N-words query = inverted lists intersection foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz qux" Sort matching documents Intersect inverted lists Pagination
  • 13. • But how do you handle typing mistakes? • Edit-distance algorithms (ex: Levenshtein) ! • levenshtein(bar, baz) = 1 (substitution)! • levenshtein(bar, br) = 1 (deletion)! • levenshtein(bar, foobar) = 3 (addition)! • Comparing a word with all known words would be too costly 13 Implementation: Query process
  • 14. 14 Implementation: Query process • The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal) Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f
  • 15. 15 Implementation: Query process Example: faz Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f faz (distance=1) faz (distance=0) faz (distance=1) faz (distance=1) faz (distance=2) faz (distance=1) faz (distance=2) faz (distance=3)
  • 16. • How are the matching documents ranked? • Number of match occurrences? TF-IDF ? • Numerical value reflecting popularity? • Number of typing mistakes? • Proximity between matched words? • … 16 Implementation: Query process
  • 18. • What I didn’t speak about: • Numerical/Geo queries (Including operators) • Advanced query syntax (boolean operators, proximity operators) • Faceting & Aggregations (Categorization) • Sharding (Horizontal scalability) • Incremental indexing (Generational data structures) • … (see u next time) 18 Missing subjects
  • 19. Q/A Now or later sylvain@algolia.com