Architecture of a search engine

•

6 j'aime•8,821 vues

My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.

Technologie Design

Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia

• Today Search means Google
• Search is a daily activity
• Search is complex
• DB are (probably) not handling text queries
• Speed and relevance are keys
• Fuzzy matching: typos!
2
Search

• Databases
• Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
• Strong query syntax (mostly SQL)
• Some operations scan all your documents
(missing index?)
3
Why Search engines?

• Search engines
• HIGHLY optimized for “SELECT” (only)
• Full-text queries: understand what is a word
• Query execution time driven by the number of
matching documents
• And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?

5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, ﬁles, ...)
Search engine
Application

• Input = documents
• Composed by multiple attributes (textual,
numerical, geo)
• Output = documents
• Full-text query and/or numerical filters
• Understandable results: match score (ranking) +
highlighting
6
How it works

• 2 distinct processes
• Indexing: storing documents in a highly
optimized way to answer queries
• Query
• Matching documents
• Ranking matched documents
7
Implementation

• Indexing means building an “index“ or “inverted
lists“
• A dedicated data structure optimized for search
• Input = a set of documents containing words
• Output = a set of words associated to
documents
8
Implementation: Indexing process

9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index

• Queries
• Goal = Retrieve all documents matching a user
query
• Order results from the highest ranked to the
lowest
10
Implementation: Query process

11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
• 1-word query = inverted lists intersection

12
Implementation: Query process
• N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching
documents
Intersect inverted
lists
Pagination

• But how do you handle typing mistakes?
• Edit-distance algorithms (ex: Levenshtein)
!
• levenshtein(bar, baz) = 1 (substitution)!
• levenshtein(bar, br) = 1 (deletion)!
• levenshtein(bar, foobar) = 3 (addition)!
• Comparing a word with all known words
would be too costly
13
Implementation: Query process

14
Implementation: Query process
• The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f

15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)

• How are the matching documents ranked?
• Number of match occurrences? TF-IDF ?
• Numerical value reflecting popularity?
• Number of typing mistakes?
• Proximity between matched words?
• …
16
Implementation: Query process

• What I didn’t speak about:
• Numerical/Geo queries (Including operators)
• Advanced query syntax (boolean operators, proximity
operators)
• Faceting & Aggregations (Categorization)
• Sharding (Horizontal scalability)
• Incremental indexing (Generational data structures)
• … (see u next time)
18
Missing subjects

Contenu connexe

Tendances

Information retrieval ssilambu111

Search EnginesKamal Acharya

Web crawlerpoonamkenkre

Machine TanslationMahsa Mohaghegh

Vector space model of information retrievalNanthini Dominique

Term weightingPrimya Tamil

Vector space model in information retrievalTharuka Vishwajith Sarathchandra

Information retrieval introductionnimmyjans4

IRGirish Khanzode

Search engine and web crawlervinay arora

Natural language processingBasha Chand

Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Gyanmanjari Institute Of Technology

XMLVahideh Zarea Gavgani

Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly

Lectures 1,2,3alaa223

The impact of web on irPrimya Tamil

CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDFAALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING

IndexingShirley Jenifer Joseph

Search EnginesRam Dutt Shukla

Information RetrievalWeb Science Research Group at Institute of Business Administration, Karachi, Pakistan

Tendances (20)

Information retrieval s

Search Engines

Web crawler

Machine Tanslation

Vector space model of information retrieval

Term weighting

Vector space model in information retrieval

Information retrieval introduction

Search engine and web crawler

Natural language processing

Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...

XML

Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...

Lectures 1,2,3

The impact of web on ir

CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF

Indexing

Search Engines

Information Retrieval

En vedette

Basics of Search Engines and AlgorithmsWeb Trainings Academy

Search Engine Powerpoint201014161

13 steps to a faster jekyll websiteRonan Berder

Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks

Google Architecture - Breaking it OpenHARMAN Services

Search enginesSahiba Khurana

SEO ArchitectureJohn Sisler

#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...Altima x Konversion

Concept search for e commerce with solrlucenerevolution

Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix

Applied machine learning for search engine relevance 3Charles Martin

Humming bird doc (1)Manasa Muppala

Machine Learning for Search at LinkedInViet Ha-Thuc

eCommerce for Everyone: What to Expect in 2017 - State of SearchElizabeth Marsten

Learning to Rank Personalized Search Results in Professional NetworksViet Ha-Thuc

Ektron 8.5 RC - SearchBillCavaUs

Machine Learning Search and SEO - Zenith; Duluth, MN. Eric Enge

Enterprise Search Summit Keynote: A Big Data Architecture for SearchSearch Technologies

Building effective landing page and Calls-to-action for conversionAyushma Pandey

Google Panda and SEOAyushma Pandey

En vedette (20)

Basics of Search Engines and Algorithms

Search Engine Powerpoint

13 steps to a faster jekyll website

Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote

Google Architecture - Breaking it Open

Search engines

SEO Architecture

#ecomRDV : Instant Search l'ultime outil de conversion et d'engagement auprès...

Concept search for e commerce with solr

Practical Machine Learning for Smarter Search with Spark+Solr

Applied machine learning for search engine relevance 3

Humming bird doc (1)

Machine Learning for Search at LinkedIn

eCommerce for Everyone: What to Expect in 2017 - State of Search

Learning to Rank Personalized Search Results in Professional Networks

Ektron 8.5 RC - Search

Machine Learning Search and SEO - Zenith; Duluth, MN.

Enterprise Search Summit Keynote: A Big Data Architecture for Search

Building effective landing page and Calls-to-action for conversion

Google Panda and SEO

Similaire à Architecture of a search engine

An Introduction to Elastic Search.Jurriaan Persyn

Intro to ElasticsearchClifford James

Webinar: Simpler Semantic Search with SolrLucidworks

Introduction to search engine-building with LuceneKai Chan

Turning a Search Engine into a Relational DatabaseMatthias Wahl

Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays

Web technology: Web searchVictor de Boer

Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ

ETL for the masses with Power Query and MRégis Baccaro

Introduction to search engine-building with LuceneKai Chan

PostgreSQL - It's kind've a nifty databaseBarry Jones

Full Text Search with LuceneWO Community

What is in a Lucene index?lucenerevolution

Musings on Secondary Indexing in HBaseJesse Yates

Lucene 101Varun Thacker

Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian

nGram full text search (by 이성욱)I Goo Lee.

Introduction to libre « fulltext » technologyRobert Viseur

Text Miningsathish sak

Building your own search engine with Apache SolrBiogeeks

Similaire à Architecture of a search engine (20)

An Introduction to Elastic Search.

Intro to Elasticsearch

Webinar: Simpler Semantic Search with Solr

Introduction to search engine-building with Lucene

Turning a Search Engine into a Relational Database

Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...

Web technology: Web search

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool

ETL for the masses with Power Query and M

Introduction to search engine-building with Lucene

PostgreSQL - It's kind've a nifty database

Full Text Search with Lucene

What is in a Lucene index?

Musings on Secondary Indexing in HBase

Lucene 101

Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی

nGram full text search (by 이성욱)

Introduction to libre « fulltext » technology

Text Mining

Building your own search engine with Apache Solr

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Histor y of HAM Radio presentation slidevu2urc

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

A Call to Action for Generative AI in 2024Results

Slack Application Development 101 Slidespraypatel2

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

GenCyber Cyber Security Day PresentationMichael W. Hawkins

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Real Time Object Detection Using Open CVKhem

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Histor y of HAM Radio presentation slide

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

A Call to Action for Generative AI in 2024

Slack Application Development 101 Slides

What Are The Drone Anti-jamming Systems Technology?

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

GenCyber Cyber Security Day Presentation

2024: Domino Containers - The Next Step. News from the Domino Container commu...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Presentation on how to chat with PDF using ChatGPT code interpreter

Finology Group – Insurtech Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Automating Google Workspace (GWS) & more with Apps Script

Real Time Object Detection Using Open CV

Driving Behavioral Change for Information Management through Data-Driven Gree...

08448380779 Call Girls In Civil Lines Women Seeking Men

How to Troubleshoot Apps for the Modern Connected Worker

Architecture of a search engine

1. Architecture of a Search Engine Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia

2. • Today Search means Google • Search is a daily activity • Search is complex • DB are (probably) not handling text queries • Speed and relevance are keys • Fuzzy matching: typos! 2 Search

3. • Databases • Optimized for INSERT/UPDATE/DELETE/ SELECT (that's a lot) • Strong query syntax (mostly SQL) • Some operations scan all your documents (missing index?) 3 Why Search engines?

4. • Search engines • HIGHLY optimized for “SELECT” (only) • Full-text queries: understand what is a word • Query execution time driven by the number of matching documents • And obviously, “LIKE '%foo bar%’" is not fulltext search 4 Why Search engines?

5. 5 Why Search engines? Search Push data periodically or in realtime Full-text search Primary storage (DB, ﬁles, ...) Search engine Application

6. • Input = documents • Composed by multiple attributes (textual, numerical, geo) • Output = documents • Full-text query and/or numerical filters • Understandable results: match score (ranking) + highlighting 6 How it works

7. • 2 distinct processes • Indexing: storing documents in a highly optimized way to answer queries • Query • Matching documents • Ranking matched documents 7 Implementation

8. • Indexing means building an “index“ or “inverted lists“ • A dedicated data structure optimized for search • Input = a set of documents containing words • Output = a set of words associated to documents 8 Implementation: Indexing process

9. 9 Implementation: Indexing process foo bar baz Doc 1 bar foo Doc 2 baz baz qux Doc 3 foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Indexing Inverted lists Documents Index

10. • Queries • Goal = Retrieve all documents matching a user query • Order results from the highest ranked to the lowest 10 Implementation: Query process

11. 11 Implementation: Query process foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz" Sort matching documents Pagination • 1-word query = inverted lists intersection

12. 12 Implementation: Query process • N-words query = inverted lists intersection foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz qux" Sort matching documents Intersect inverted lists Pagination

13. • But how do you handle typing mistakes? • Edit-distance algorithms (ex: Levenshtein) ! • levenshtein(bar, baz) = 1 (substitution)! • levenshtein(bar, br) = 1 (deletion)! • levenshtein(bar, foobar) = 3 (addition)! • Comparing a word with all known words would be too costly 13 Implementation: Query process

14. 14 Implementation: Query process • The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal) Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f

15. 15 Implementation: Query process Example: faz Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f faz (distance=1) faz (distance=0) faz (distance=1) faz (distance=1) faz (distance=2) faz (distance=1) faz (distance=2) faz (distance=3)

16. • How are the matching documents ranked? • Number of match occurrences? TF-IDF ? • Numerical value reflecting popularity? • Number of typing mistakes? • Proximity between matched words? • … 16 Implementation: Query process

17. 17 Several implementations

18. • What I didn’t speak about: • Numerical/Geo queries (Including operators) • Advanced query syntax (boolean operators, proximity operators) • Faceting & Aggregations (Categorization) • Sharding (Horizontal scalability) • Incremental indexing (Generational data structures) • … (see u next time) 18 Missing subjects

19. Q/A Now or later sylvain@algolia.com

Architecture of a search engine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Architecture of a search engine

Similaire à Architecture of a search engine (20)

Dernier

Dernier (20)

Architecture of a search engine