SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
SearchLand:
search quality for beginners




Valeria de Paiva
Santa Clara University
Nov 2010
Check http://www.parc.com/event/934/adventures-in-searchland.html
Outline
    SearchLand
    Search engine basics
    Measuring search quality
    Conclusions...
    and Opportunities
SearchLand?
“Up until now most search
  engine development has
  gone on at companies with
  little publication of technical
  details. This causes search
  engine technology to remain
  largely a black art and to be
  advertising oriented.’’ Brin
  and Page, “The anatomy of
  a search engine”, 1998
  Disclaimer: This talk presents the guesswork of the author.
  It does not reflect the views of my previous employers or practices at work.

                                                          Thanks kxdc2007!
SearchLand
    Twelve years later the
     complaint remains…
    Gap between research and
     practice widened
    Measuring SE quality is
     `adversarial computing’
    Many dimensions of quality:
     pictures, snippets, categories,
     suggestions, timeliness, speed,
     etc…
SearchLand: Draft Map
Based on slides for Croft,
 Metzler and Strohman’s
 “Search Engines:
 Information Retrieval in
 Practice”, 2009 the tutorial
 `Web Search Engine
 Metrics’ by Dasdan,
 Tsioutsiouliklis and
 Velipasaoglu for WWW09/10
and Hugh Williams slides for
 ACM SIG on Data Mining

          THANKS GUYS!!…
Search Engine Basics...
Web search engines don’t
 search the web:
  They search a copy of the web
  They crawl documents from the
   web
  They index the documents, and
   provide a search interface
   based on that index
  They present short snippets that
   allow users to judge relevance
  Users click on links to visit the
   actual web document
Search Engines Basics
       Basic architecture




From Web Search Engine Metrics for Measuring User Satisfaction
Tutorial at WWW conference, by Dasdan et al 2009,2010
Crawling
“you don’t need a lot of thinking to do crawling; you need bandwidth”




         CRAWLERS
         Fetch new resources from new domains or pages
         Fetch new resources from existing pages
         Re-fetch existing resources that have changed

     Prioritization is essential:
        There are far more URLs than available fetching bandwidth
        For large sites, it’s difficult to fetch all resources
        Essential to balance re-fetch and discovery
        Essential to balance new site exploration with old site
           exploration

         Snapshot or incremental? How broad? How seeded?
Writing/running a crawler
      isn’t straightforward….


                Crawler Challenges
Crawlers shouldn’t overload or overvisit sites
Must respect robots.txt exclusion standard

Many URLs exist for the same resource
URLs redirect to other resources (often)
Dynamic pages can generate loops, unending lists, and
 other traps
Not Found pages often return ok HTTP codes
DNS failures
Pages can look different to end-user browsers and crawlers
Pages can require JavaScript processing
Pages can require cookies
Pages can be built in non-HTML environments
To index or not to index…
After crawling need to convert documents into
  index terms… to create an index.




You need to decide how much of the page you
 index and which `features or signals' you care
 about.
Also which kinds of file to index? Text, sure.
Pdfs, jpegs, audio, torrents, videos, Flash???
The obvious, the ugly and the
                  clever…



Some obvious:
hundreds of billions of web pages…
Neither practical nor desirable to index all. Must remove:
spam pages, illegal page, malware, repetitive or duplicate pages,
crawler traps, pages that no longer exist, pages that have
substantially changed, etc…
Most search engines index in the range of 20 to 50 billion
documents (Williams)
How many pages each engine indexes, and how many pages are
on the web are hard research problems…
Indexing: the obvious…
   which are the right pages?
   pages that users want (duh…)
   Pages: popular in the web link graph
         match queries
         from popular sites
         clicked on in search results
         shown by competitors
         in the language or market of the users
         distinct from other pages
         change at a moderate rate, etc…
The head is stable, The tail consists of billions of
 candidate pages with similar scores
Indexing: more obvious…features!
How to index pages so that we
 can find them?
A feature (signal) is an attribute of a
  document that a computer can
  detect, and that we think may
  indicate relevance.

Some features are obvious, some are
 secret sauce and how to use them
 is definitely a trade secret for each
 search engine.
Features: the obvious…
 Term matching                         Term frequency
  The system should prefer                  The system should prefer
   documents that contain                    documents that contain the
   the query terms.                          query terms many times.




Inverse document frequency
Rare words are more important than frequent words

TD/IDF Karen Sparck-Jones
Indexing: some ugly…
Proximity                          Term Location
Words that are close together in   Prefer documents that contain
  the query should be close           query words in the title or
  together in relevant                headings.
  documents.




      Prefer documents that contain query words in
       the URL.
      www.whitehouse.gov
      www.linkedin.com/in/valeriadepaiva
Indexing: some cleverness?


Prefer documents that are
 authoritative and popular.

 HOW?
PageRank
 in 3 easy bullets:
1.  The random surfer is a
    hypothetical user that clicks
    on web links at random.
2. Popular pages connect…
3: leverage connections?
  Think of a gigantic
    graph (connected
    pages) and its
    transition matrix
  Make the matrix
    probabilistic
  Those correspond to
    the random surfer
    choices
  Find its principal eigen-
    vector= pagerank
PageRank is the proportion of time that a random
surfer would jump to a particular page.
Indexing: summing up
Store (multiple copies of?) the web in a
   document store
Iterate over the document store to choose
   documents
Create an index, and ship it to the index
   serving nodes
Repeat…
Sounds easy? It isn’t!

Three words: scale-up, parallelism, time
Selection and Ranking



Quality of Search: what do we want to do?
Optimize for user satisfaction in each
 component of the pipeline
Need to measure user satisfaction
Evaluating Relevance is HARD!
  Effectiveness, efficiency and
cost are related - difficult trade-off
  Two main kinds of approach:


     IR traditional evaluations
     click data evaluation & log
     analysis
    Many books on IR, many
     patents/trade secrets from
     search engines…
    Another talk
IR evaluation vs Search Engine
    TREC competitions (NIST and
     U.S. Department of Defense),
     since 1992
    Goal: provide the infrastructure
     necessary for large-scale
     evaluation of text retrieval
     methodologies
    several tracks, including a Web
     track, using ClueWeb09, one
     billion webpages
    TREC results are baseline, do
     they work for search engines?
Evaluating Relevance in IR…
if universe small you can check easily precision and recall
IR vs Search Engines
Evaluating the whole system…
        Relevance metrics not based on users
         experiences or tasks
        SCoCveCoverCome attempts: Co

Coverage metrics
Latency and Discovery metrics
Diversity metrics
Freshness metrics
Freshness of snippets?
Measuring change of all internet?
Presentation metrics:
suggestions, spelling corrections, snippets, tabs, categories, definitions, images,
videos, timelines, maplines, streaming results, social results, …
Conclusions
    Relevance is elusive, but
     essential.
    Improvement requires
     metrics and analysis,
     continuously
    Gone over a rough map of
     the issues and some
     proposed solutions
    Many thanks to Dasdan, Tsioutsiouliklis and
     Velipasaoglu, Croft, Metzler and Strohman,
     (especially Strohman!) and Hugh Williams
     for slides/pictures   .
Coda
    There are many search engines.
     Their results tend to be very
     similar. (how similar?)
    Are we seeing everything?
     Reports estimate we can see
     only 15% of the existing web.
    Probing the web is mostly
     popularity based. You're likely to
     see what others have seen
     before. But your seeing
     increases the popularity of what
     you saw, thereby reducing the
     pool of available stuff. Vicious or
     virtuous circle? How to measure?
References
Croft, Metzler and Strohman’s “Search
 Engines: Information Retrieval in Practice”,
 2009, Addison-Wesley
the tutorial `Web Search Engine Metrics’ by
 Dasdan, Tsioutsiouliklis and Velipasaoglu for
 WWW09/10, available from http://
 www2010.org/www/program/tutorials/
Hugh Williams slides for ACM Data Mining
 May 2010
Anna Patterson:
 Why Writing Your Own Search Engine Is Hard
ACM Q, 2004

Contenu connexe

Tendances

Advanced google searching (1)
Advanced google searching (1)Advanced google searching (1)
Advanced google searching (1)Brenda Crawford
 
Trekking through the world of information
Trekking through the world of informationTrekking through the world of information
Trekking through the world of informationKristin Hokanson
 
PR 313 - Public Relations & the World Wide Web
PR 313 - Public Relations & the World Wide WebPR 313 - Public Relations & the World Wide Web
PR 313 - Public Relations & the World Wide WebBrett Atwood
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersBill Slawski
 
Think Like A Librarian
Think Like A LibrarianThink Like A Librarian
Think Like A LibrarianLeAnn
 
Think Like A Librarian
Think Like A LibrarianThink Like A Librarian
Think Like A LibrarianLeAnn
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand SainiDr,Saini Anand
 
Search Engine Strategies
Search  Engine  StrategiesSearch  Engine  Strategies
Search Engine Strategiesjsotir
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for studentslarchmeany1
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behavioursVernon Fowler
 
SharePoint 2010 Findability
SharePoint 2010 FindabilitySharePoint 2010 Findability
SharePoint 2010 FindabilityDave Maskell
 
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...cameron
 

Tendances (20)

Advanced google searching (1)
Advanced google searching (1)Advanced google searching (1)
Advanced google searching (1)
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
tagging idea
tagging idea tagging idea
tagging idea
 
Library vs Google
Library vs GoogleLibrary vs Google
Library vs Google
 
Trekking through the world of information
Trekking through the world of informationTrekking through the world of information
Trekking through the world of information
 
PR 313 - Public Relations & the World Wide Web
PR 313 - Public Relations & the World Wide WebPR 313 - Public Relations & the World Wide Web
PR 313 - Public Relations & the World Wide Web
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papers
 
Think Like A Librarian
Think Like A LibrarianThink Like A Librarian
Think Like A Librarian
 
Think Like A Librarian
Think Like A LibrarianThink Like A Librarian
Think Like A Librarian
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
 
Search engines
Search enginesSearch engines
Search engines
 
Search Engine Strategies
Search  Engine  StrategiesSearch  Engine  Strategies
Search Engine Strategies
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for students
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Meta search engine
Meta search engineMeta search engine
Meta search engine
 
User centred design and students' library search behaviours
User centred design and students' library search behavioursUser centred design and students' library search behaviours
User centred design and students' library search behaviours
 
Web Servers
Web Servers Web Servers
Web Servers
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
SharePoint 2010 Findability
SharePoint 2010 FindabilitySharePoint 2010 Findability
SharePoint 2010 Findability
 
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
 

Similaire à Searchland: Search quality for Beginners

Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningValeria de Paiva
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdfCreationlabz
 
What IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherWhat IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherIan Lurie
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008Blogtalk 2008
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than GoogleDr Trivedi
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engineankur881120
 
[Book];[the-beginners-guide-to-seo]
[Book];[the-beginners-guide-to-seo][Book];[the-beginners-guide-to-seo]
[Book];[the-beginners-guide-to-seo]AiiM Education
 
Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoalexanderandreya
 
Seomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEOSeomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEOTyson Stevens
 
the-beginners-guide-to-seo
the-beginners-guide-to-seothe-beginners-guide-to-seo
the-beginners-guide-to-seogs-seo-club
 
SEOMoz - The Beginner's Guide to Search Engine Optimization
SEOMoz - The Beginner's Guide to Search Engine OptimizationSEOMoz - The Beginner's Guide to Search Engine Optimization
SEOMoz - The Beginner's Guide to Search Engine OptimizationStepValue - Web Intelligence
 
SEOPPCTraining - Beginners Guide to SEO
SEOPPCTraining - Beginners Guide to SEOSEOPPCTraining - Beginners Guide to SEO
SEOPPCTraining - Beginners Guide to SEOSEO PPC Training
 
successful internet searching presentation
successful internet searching presentationsuccessful internet searching presentation
successful internet searching presentationreubelicious
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyChiara Fox Ogan
 

Similaire à Searchland: Search quality for Beginners (20)

Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
beginners-guide.pdf
beginners-guide.pdfbeginners-guide.pdf
beginners-guide.pdf
 
What IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherWhat IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each Other
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
 
Business Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search EngineBusiness Intelligence Solution Using Search Engine
Business Intelligence Solution Using Search Engine
 
sunny-slides
sunny-slidessunny-slides
sunny-slides
 
[Book];[the-beginners-guide-to-seo]
[Book];[the-beginners-guide-to-seo][Book];[the-beginners-guide-to-seo]
[Book];[the-beginners-guide-to-seo]
 
Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seo
 
Seomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEOSeomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEO
 
the-beginners-guide-to-seo
the-beginners-guide-to-seothe-beginners-guide-to-seo
the-beginners-guide-to-seo
 
SEOMoz - The Beginner's Guide to Search Engine Optimization
SEOMoz - The Beginner's Guide to Search Engine OptimizationSEOMoz - The Beginner's Guide to Search Engine Optimization
SEOMoz - The Beginner's Guide to Search Engine Optimization
 
SEOPPCTraining - Beginners Guide to SEO
SEOPPCTraining - Beginners Guide to SEOSEOPPCTraining - Beginners Guide to SEO
SEOPPCTraining - Beginners Guide to SEO
 
E3602042044
E3602042044E3602042044
E3602042044
 
successful internet searching presentation
successful internet searching presentationsuccessful internet searching presentation
successful internet searching presentation
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web Mining
Web Mining Web Mining
Web Mining
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture Strategy
 

Plus de Valeria de Paiva

Dialectica Categorical Constructions
Dialectica Categorical ConstructionsDialectica Categorical Constructions
Dialectica Categorical ConstructionsValeria de Paiva
 
Logic & Representation 2021
Logic & Representation 2021Logic & Representation 2021
Logic & Representation 2021Valeria de Paiva
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear LogicsValeria de Paiva
 
Dialectica Categories Revisited
Dialectica Categories RevisitedDialectica Categories Revisited
Dialectica Categories RevisitedValeria de Paiva
 
Networked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better ScienceNetworked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better ScienceValeria de Paiva
 
Going Without: a modality and its role
Going Without: a modality and its roleGoing Without: a modality and its role
Going Without: a modality and its roleValeria de Paiva
 
Problemas de Kolmogorov-Veloso
Problemas de Kolmogorov-VelosoProblemas de Kolmogorov-Veloso
Problemas de Kolmogorov-VelosoValeria de Paiva
 
Natural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and MachinesNatural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and MachinesValeria de Paiva
 
The importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in PortugueseThe importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in PortugueseValeria de Paiva
 
Negation in the Ecumenical System
Negation in the Ecumenical SystemNegation in the Ecumenical System
Negation in the Ecumenical SystemValeria de Paiva
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear LogicsValeria de Paiva
 
Semantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACTSemantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACTValeria de Paiva
 
Categorical Explicit Substitutions
Categorical Explicit SubstitutionsCategorical Explicit Substitutions
Categorical Explicit SubstitutionsValeria de Paiva
 
Logic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for DialogLogic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for DialogValeria de Paiva
 
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)Valeria de Paiva
 

Plus de Valeria de Paiva (20)

Dialectica Comonoids
Dialectica ComonoidsDialectica Comonoids
Dialectica Comonoids
 
Dialectica Categorical Constructions
Dialectica Categorical ConstructionsDialectica Categorical Constructions
Dialectica Categorical Constructions
 
Logic & Representation 2021
Logic & Representation 2021Logic & Representation 2021
Logic & Representation 2021
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear Logics
 
Dialectica Categories Revisited
Dialectica Categories RevisitedDialectica Categories Revisited
Dialectica Categories Revisited
 
PLN para Tod@s
PLN para Tod@sPLN para Tod@s
PLN para Tod@s
 
Networked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better ScienceNetworked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better Science
 
Going Without: a modality and its role
Going Without: a modality and its roleGoing Without: a modality and its role
Going Without: a modality and its role
 
Problemas de Kolmogorov-Veloso
Problemas de Kolmogorov-VelosoProblemas de Kolmogorov-Veloso
Problemas de Kolmogorov-Veloso
 
Natural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and MachinesNatural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and Machines
 
Dialectica Petri Nets
Dialectica Petri NetsDialectica Petri Nets
Dialectica Petri Nets
 
The importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in PortugueseThe importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in Portuguese
 
Negation in the Ecumenical System
Negation in the Ecumenical SystemNegation in the Ecumenical System
Negation in the Ecumenical System
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear Logics
 
Semantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACTSemantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACT
 
NLCS 2013 opening slides
NLCS 2013 opening slidesNLCS 2013 opening slides
NLCS 2013 opening slides
 
Dialectica Comonads
Dialectica ComonadsDialectica Comonads
Dialectica Comonads
 
Categorical Explicit Substitutions
Categorical Explicit SubstitutionsCategorical Explicit Substitutions
Categorical Explicit Substitutions
 
Logic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for DialogLogic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for Dialog
 
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
 

Dernier

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Searchland: Search quality for Beginners

  • 1. SearchLand: search quality for beginners Valeria de Paiva Santa Clara University Nov 2010 Check http://www.parc.com/event/934/adventures-in-searchland.html
  • 2.
  • 3. Outline   SearchLand   Search engine basics   Measuring search quality   Conclusions...   and Opportunities
  • 4. SearchLand? “Up until now most search engine development has gone on at companies with little publication of technical details. This causes search engine technology to remain largely a black art and to be advertising oriented.’’ Brin and Page, “The anatomy of a search engine”, 1998 Disclaimer: This talk presents the guesswork of the author. It does not reflect the views of my previous employers or practices at work. Thanks kxdc2007!
  • 5. SearchLand   Twelve years later the complaint remains…   Gap between research and practice widened   Measuring SE quality is `adversarial computing’   Many dimensions of quality: pictures, snippets, categories, suggestions, timeliness, speed, etc…
  • 6. SearchLand: Draft Map Based on slides for Croft, Metzler and Strohman’s “Search Engines: Information Retrieval in Practice”, 2009 the tutorial `Web Search Engine Metrics’ by Dasdan, Tsioutsiouliklis and Velipasaoglu for WWW09/10 and Hugh Williams slides for ACM SIG on Data Mining THANKS GUYS!!…
  • 7. Search Engine Basics... Web search engines don’t search the web: They search a copy of the web They crawl documents from the web They index the documents, and provide a search interface based on that index They present short snippets that allow users to judge relevance Users click on links to visit the actual web document
  • 8. Search Engines Basics Basic architecture From Web Search Engine Metrics for Measuring User Satisfaction Tutorial at WWW conference, by Dasdan et al 2009,2010
  • 9. Crawling “you don’t need a lot of thinking to do crawling; you need bandwidth” CRAWLERS Fetch new resources from new domains or pages Fetch new resources from existing pages Re-fetch existing resources that have changed Prioritization is essential: There are far more URLs than available fetching bandwidth For large sites, it’s difficult to fetch all resources Essential to balance re-fetch and discovery Essential to balance new site exploration with old site exploration Snapshot or incremental? How broad? How seeded?
  • 10. Writing/running a crawler isn’t straightforward…. Crawler Challenges Crawlers shouldn’t overload or overvisit sites Must respect robots.txt exclusion standard Many URLs exist for the same resource URLs redirect to other resources (often) Dynamic pages can generate loops, unending lists, and other traps Not Found pages often return ok HTTP codes DNS failures Pages can look different to end-user browsers and crawlers Pages can require JavaScript processing Pages can require cookies Pages can be built in non-HTML environments
  • 11. To index or not to index… After crawling need to convert documents into index terms… to create an index. You need to decide how much of the page you index and which `features or signals' you care about. Also which kinds of file to index? Text, sure. Pdfs, jpegs, audio, torrents, videos, Flash???
  • 12. The obvious, the ugly and the clever… Some obvious: hundreds of billions of web pages… Neither practical nor desirable to index all. Must remove: spam pages, illegal page, malware, repetitive or duplicate pages, crawler traps, pages that no longer exist, pages that have substantially changed, etc… Most search engines index in the range of 20 to 50 billion documents (Williams) How many pages each engine indexes, and how many pages are on the web are hard research problems…
  • 13. Indexing: the obvious… which are the right pages? pages that users want (duh…) Pages: popular in the web link graph match queries from popular sites clicked on in search results shown by competitors in the language or market of the users distinct from other pages change at a moderate rate, etc… The head is stable, The tail consists of billions of candidate pages with similar scores
  • 14. Indexing: more obvious…features! How to index pages so that we can find them? A feature (signal) is an attribute of a document that a computer can detect, and that we think may indicate relevance. Some features are obvious, some are secret sauce and how to use them is definitely a trade secret for each search engine.
  • 15. Features: the obvious… Term matching Term frequency The system should prefer The system should prefer documents that contain documents that contain the the query terms. query terms many times. Inverse document frequency Rare words are more important than frequent words TD/IDF Karen Sparck-Jones
  • 16. Indexing: some ugly… Proximity Term Location Words that are close together in Prefer documents that contain the query should be close query words in the title or together in relevant headings. documents. Prefer documents that contain query words in the URL. www.whitehouse.gov www.linkedin.com/in/valeriadepaiva
  • 17. Indexing: some cleverness? Prefer documents that are authoritative and popular. HOW? PageRank in 3 easy bullets: 1.  The random surfer is a hypothetical user that clicks on web links at random. 2. Popular pages connect…
  • 18. 3: leverage connections? Think of a gigantic graph (connected pages) and its transition matrix Make the matrix probabilistic Those correspond to the random surfer choices Find its principal eigen- vector= pagerank PageRank is the proportion of time that a random surfer would jump to a particular page.
  • 19. Indexing: summing up Store (multiple copies of?) the web in a document store Iterate over the document store to choose documents Create an index, and ship it to the index serving nodes Repeat… Sounds easy? It isn’t! Three words: scale-up, parallelism, time
  • 20. Selection and Ranking Quality of Search: what do we want to do? Optimize for user satisfaction in each component of the pipeline Need to measure user satisfaction
  • 21. Evaluating Relevance is HARD!   Effectiveness, efficiency and cost are related - difficult trade-off   Two main kinds of approach: IR traditional evaluations click data evaluation & log analysis   Many books on IR, many patents/trade secrets from search engines…   Another talk
  • 22. IR evaluation vs Search Engine   TREC competitions (NIST and U.S. Department of Defense), since 1992   Goal: provide the infrastructure necessary for large-scale evaluation of text retrieval methodologies   several tracks, including a Web track, using ClueWeb09, one billion webpages   TREC results are baseline, do they work for search engines?
  • 23. Evaluating Relevance in IR… if universe small you can check easily precision and recall
  • 24. IR vs Search Engines
  • 25. Evaluating the whole system… Relevance metrics not based on users experiences or tasks SCoCveCoverCome attempts: Co Coverage metrics Latency and Discovery metrics Diversity metrics Freshness metrics Freshness of snippets? Measuring change of all internet? Presentation metrics: suggestions, spelling corrections, snippets, tabs, categories, definitions, images, videos, timelines, maplines, streaming results, social results, …
  • 26. Conclusions   Relevance is elusive, but essential.   Improvement requires metrics and analysis, continuously   Gone over a rough map of the issues and some proposed solutions   Many thanks to Dasdan, Tsioutsiouliklis and Velipasaoglu, Croft, Metzler and Strohman, (especially Strohman!) and Hugh Williams for slides/pictures .
  • 27. Coda   There are many search engines. Their results tend to be very similar. (how similar?)   Are we seeing everything? Reports estimate we can see only 15% of the existing web.   Probing the web is mostly popularity based. You're likely to see what others have seen before. But your seeing increases the popularity of what you saw, thereby reducing the pool of available stuff. Vicious or virtuous circle? How to measure?
  • 28.
  • 29. References Croft, Metzler and Strohman’s “Search Engines: Information Retrieval in Practice”, 2009, Addison-Wesley the tutorial `Web Search Engine Metrics’ by Dasdan, Tsioutsiouliklis and Velipasaoglu for WWW09/10, available from http:// www2010.org/www/program/tutorials/ Hugh Williams slides for ACM Data Mining May 2010 Anna Patterson: Why Writing Your Own Search Engine Is Hard ACM Q, 2004