SlideShare une entreprise Scribd logo
1  sur  53
Search 101 Workshop: 1.2 Query Understanding
1
Context, Similarity, and Recommendations
Matt Corkum
May 2022
Query understanding?
impact of NLP ?
Predict, Execute, Communicate ?
Query understanding?
●communication channel [searcher query & search engine]
●FOCUS on WHAT searcher wants
●Measure & optimize query performance
●perfect “search probe”
○Conversation with search engine
○independent of index
■ Focus Less about results
■ don’t expect auto filter irrelevant results
Figure out what the user
wants
3
Clinical Phase (Query ?)
●“My patient is not responding adequately to methotrexate, and I need to
consider the next steps”
4
5
Searcher?
● Persona
● User context
○ Current characters entered ->
query
○ Location
○ Time
○ Date
○ Language
○ This session’s queries
○ History past queries
● Predict User’s intent right now?
6
Spelling corrections, Autocomplete, & Instant Search
●aka: autosuggest / query completion
●Feedback loops (good here, not good in DS)
Initial focus – Just NLP?
●OR - Focus PRIOR to calling
Search
●Or - Strings to Suggesters ?
●When ?
○Before search engine
scores
○important – nail it =
8
Intent ~= Predict
●2 stage predict ?
○characters
■entered
■correct misspellings
■Influence of letter
frequency
○Phrases
■Influence phrases /
knowledge
○Counterfactual estimators
■Logs to predict (offline)
9
Predict
10
Search Suggest Components
11
Communication
●Chapter 13 AI-Powered Search
(http://aipoweredsearch.com)
“Semantic Search with Dense
Vectors”.
12
Also strike out what is missing
Search Query Thoughts
13
EVERY INSTANCE OF A
WORD / PHRASE
…..HAS A UNIQUE
MEANING AT THIS TIME
YOU CAN’T IMPROVE
WHAT YOU CAN’T
MEASURE
Similar queries = similar results ?
●Latest breakthroughs in artificial intelligence
●Latest advancements in artificial intelligence
●Latest advancements on artificial intelligence
●Latest breakthroughs in AI
●Piu’ recenti sviluppi di ricercar sull’’intelligenza artificiale
○DYM: “Most recent research developments on artificial intelligence”
●Results different for the same basic query intent
●Search engines are “statistical frequency hunters” out of the box
14
What does Love mean?
15
Query Semantic graph: “Love”
Query Love – Child Context
Same query + intent / different results
●Query: Buy tickets Boston Garden tomorrow
●date processing incorrect or diff events on diff days
○Celtics Tuesday
○circus Wednesday
○sold out, suggest Thursday?
●Conversational AI
○next night tickets (context of past queries)
○Useful to disambiguate sometimes
18
Processes: Search / AI pipelines
19
Lucidworks Fusion – AI/Search Product (2018)
Query Text Analysis ->
Autocomplete / Instant Search
● Character Filtering diacritics, unicode, capitalization
● Terms tokenization (punctuation - commas, hyphens, apostrophes,…)
● Spell correction 10-15% misspelled queries, Solr 7.5 SolrTextTagger
● Inflection stemming / lemmatizing
● Syntactic analysis NLP = POS (Noun Phrases), NER(Entity recognition/extraction),
■ hierarchies (taxonomies and ontologies)
● Semantic analysis Vector (word2vec, node2vec), deep parse-dependency, BERT-context sensitive
■ Query segmentation, tagging, & scoping
■ Lexical Databases - knowledge graphs
■ Disambiguation
● Query re-writing add tokens to help
● Query relaxation do not consider all tokens [when no results – risky - careful]
● LTR / Adjustments make $$, crowd source signals
● Autocomplete yippy !
● Instant search very cool if all aligns with confidence 🡪 run new search query / results
User Query
●Machine Learning research and development Portland, OR software engineer
Hadoop, java
●Traditional Lucene Query Parser
●(Machine and Learning and research and development and Portland)
●OR
●(software and engineer and Hadoop and java)
YUK !
22
Character filtering
●Capitalization, diacritics, Unicode, Numbers, alphanumeric processing
●Careful
○conflates abbreviations to lower case
○Consider: process the same content many ways into fields
Query => Collection of Terms
●Language tokenization
○(punctuation - commas, hyphens, apostrophes,…)
●N-gram tokenization
○(unigram) 1 word, (bi-gram) 2 word, (trigrams) 3 word
○n-gram char tokenization (2 letter, 3 letter grams) – useful for computing priors for spell checking
●Inflection = conflated strings (shorter term index, more documents)
○increases recall, precision may suffer
○Stemming and lemmatization
■ Both inflection styles may need customization to handle your words
■ (corporals, corporation, corporations 🡪 corp, for a stemmer ---hmmm)
Spell correction = Candidate Generation
● queries misspelled 10-15%
○ missing/extra/swapping/replacing letters
○ leverage n-gram letter language models
■ priors of frequency of letters
■ 2-grams, and tri-grams
● https://norvig.com/spell-correct.html
● https://en.wikipedia.org/wiki/Noisy_channel_model
● https://web.stanford.edu/~jurafsky/slp3/3.pdf
● https://solr.apache.org/guide/6_6/suggester.html
● Solr 7.5 TextTagger
Query re-writing
●Improve query performance
●Process: add abbreviations, synonyms, common miss spellings -> proper
spelling (hash)
●additional tokens
○ For recall (adding in synonyms, abbreviations, alternate phrasings, missspelling)
○ For precision (pair wise boost of query terms, consider POS tags)
4 term query has abbreviation: ABC_abbrev
○ A B C D
●Simple re-written query – abbrev + pairwise boost
○ (A B C D) or (ABC_abbrev ) and ( “A B”~1000 or “B C”~1000 or “C D”~1000)
○ Even better when leveraging related semantic items
Query relaxation
●When
○no results ?
○Careful - riskier
●Consider this approach – highlight matching terms
○may want to show the results holding back part of the query not matched (cross it out missing
terms on presenting the result) documents
○ignore stop words
○drop specificity - white HDMI cable – HDMI cable color may or may not be important
Entity recognition
●strings to things
○ synonyms (isa = equivalent terms, careful)
○ hypernyms (hypernym =Footwear instead of a more specific (hyponyms) sneakers or hiking boots)
○ taxonomies (hierarchical lists)
○ ontologies ( = taxonomy with relations , this drug treats this condition)
○ knowledge graphs
○Related?= weighted synonyms ? [ if language based ]
■ vectors (word, para, doc, graph) to vector (language based better sometimes – longer queries, not better on short
queries) - word2vec- not synonym - placement, glove, BERT(language)
■ Nearest neighbor search similarity
■ effective AI-driven autocomplete (semantic search) (http://aipoweredsearch.com)
■ “Semantic Search with Dense Vectors” https://www.manning.com/books/ai-powered-
search?a_aid=1&a_bid=e47ada24&chan=aips
Syntactic analysis
●NLP POS (Natural Language Processing /Part of speech)
○shallow parse
○Noun Phrases
●NER (named entities, relations, semantic types, weighted, implied/direct
matches)
Machine Learning research and development
Portland, OR software engineer Hadoop, java
30
NLP noun phrases
• machine learning
• research and development
• Portland, OR
• software engineer
• Hadoop
• java
Semantics
●Vectors (word, doc, graph parse/traversal)
○not the actual word but an embedding [computable space]
■ (footwear -> high heels, foot wear -> white socks)
○query classification
○query semantic matches (relaxation to different concepts where it makes sense)
■ Disambiguation
■ spelling correction
●Influence of business rules (more $$ result) (careful $ != happy users)
Search Suggestions 2 jobs
●Reduce searcher effort
●Improve query performance
32
Autosuggest ranked candidates
Query segmentation - semantic units
●(classes noun phrases / relations) -dictionary or statistical
●One Suggestion
○from content (initial proxy for NP frequency)
○query processing (query logs)
○combination (covers queries aligned with content)
●Learn Related/Similar (Deep Learning) – Neural Search
○Word representations (learn relatedness = synonyms, antonyms)
○Document representations (classifier, level of writing, etc)
○DL accels here:
■ Image representations – binary – not caption text
■ Query can be an image – finds similar images to it
Semantic Knowledge Graph
Long quiz question (query)
●Query: Geriatric diabetic male concerned about wound care pain options?
●Require concepts for: Geriatric, male, pain – get result count ?
○If zero results – query relax process until results exist
■ (over specified to the content)
■ Reduce required concepts (context to respond)
■ some concepts are not needed (created a list)
●Precision: Rank order leveraging the non-concept words
○(“wound, care, concern, “options”)
●Why: Boosting context results = generic context
○(Content 1.5M diabetic documents responses)
36
Lexical Databases
●knowledge graphs
●hierarchies (taxonomies / ontologies)
●cool how to leverage knowledge graphs
○ https://wiki.pathmind.com/graph-analysis
●Graph Matching Networks for Learning the Similarity of Graph Structured
Objects
Words / Meanings
●Words are ambiguous
●Meanings are not
38
Meaning manifolds – Computable Space (Semantic) -
https://arxiv.org/pdf/2011.09413.pdf
39
Noun Phrases -> Semantically close manifolds
40
Machine learning
Noun Phrase Near match Another near match
Machine learning Data scientist AI or ML or DL
Research and development R&D Applied research
Portland, OR Portland, Oregon Geo_location
Software engineer Software developer AI programmer
hadoop Big data Hive or Spark or Databricks or
SnowFlake or Google Big Table
Query, Parsing, Semantics, Expanded
Other Similarity Ranked Response techniques
●BRF: Blind Relevance Feedback
○ short queries to long queries – dipping into longer docs to find MLT vector to find things like it in
other short content (Sherpath {Nursing 101 Courseware} leveraged this initially)
○ Topic – docs as bag of topics (top TF/IDF terms from top n-results)
○ Ex. sterilization – topic to cover – determined the top MLT query terms and results in matching
Questions to ask a student
■ Proper hand washing
■ planned parenthood (ball snipping – aka make a male sterile)
●MLT (More like words) or MLT (semantic – not a word matcher = ANN, SKG)
●LLT (Less Like this) – Umass (B. Croft 1990s) – Loooong Boolean expressions
●Recommenders – ranked response predicting the user next action
○ (huge in what to purchase/listen to next (Amazon, Netflix, Spotify, …)
○ Multi-armed, stochastic bandits (A/B testing)
○ https://www.aaai.org/AAAI21Papers/AAAI-8642.LuS.pdf
Instant search
● Skip auto complete –
○ Run the query (search probe)
○ Visibile communication Mechanisms
■ highlight was is important in a search result
■ Cross out what a given result is missing from the search probe
■ Finding Love in content
Test
●New Technology
○search better unless
■ HRT/SRF – proven better
■ combine
● Evaluated performance – total engagement (search value)
● Low query performance ( )
● Overall better
■ cost effective
44
References
● Max Irwin - BERT Search experience https://opensourceconnections.com/blog/2021/09/01/the-bert-search-experience
● Daniel Tunkelang - Query Understanding https://queryunderstanding.com/introduction-c98740502103
● Trey Grainger
○ User Intent - AI powered search https://www.youtube.com/watch?v=sTWMn0LSoiA
○ Semantic Knowledge Graph https://www.treygrainger.com/posts/page/2/ (27 mins in is the step by step technology)
○ code https://github.com/careerbuilder/semantic-knowledge-graph
● State of the Art in NLP nlpprogress.com/
● Neural Search – Neural Info retrieval - SIGIR 2016 – www.Microsoft.com/en-us/research/event/neuir2016
● Deep Learning 4 search https://github.com/dl4s/dl4s
● Solr Graph Traversal https://solr.apache.org/guide/6_6/graph-traversal.html
● Graph neural networks (GNNs)
○ https://distill.pub/2021/gnn-intro/
○ https://distill.pub/2021/understanding-gnns/
○ https://quantdare.com/understanding-neural-networks-with-graphs/
○ There is more nuance than just putting things in graphs for retrieval – user intent
Author Biographies – Thank you for listening
● Matt Corkum
● corkum@gmail.com 904 772 5383
● Founding member Elsevier Labs (2001) and
Altavista (1995)
● Interests: Search, NLP, ML/DL, 3D graphics,
Big Data Computing/Distributed Processing
● LinkedIn: https://www.linkedin.com/in/matt-
corkum-347b0b/
● Twitter: https://twitter.com/matt_corkum
46
Slides not making the cut
47
Docs -> terms, query -> terms
Query Understanding
Query Understanding
Query Understanding
Query Understanding

Contenu connexe

Similaire à Query Understanding

Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Data science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedData science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedRangarajan Chari
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!Paul Borgermans
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingGeeks Anonymes
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Disambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalDisambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalMadhusudan Daad
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing WorkshopLakshya Sivaramakrishnan
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)Nicolas Van Labeke
 

Similaire à Query Understanding (20)

Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
Data science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedData science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridged
 
Tool criticism
Tool criticismTool criticism
Tool criticism
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Disambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalDisambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document Retrieval
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Human-Centric Machine Learning
Human-Centric Machine LearningHuman-Centric Machine Learning
Human-Centric Machine Learning
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
 

Dernier

Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Dernier (20)

Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 

Query Understanding

  • 1. Search 101 Workshop: 1.2 Query Understanding 1 Context, Similarity, and Recommendations Matt Corkum May 2022
  • 2. Query understanding? impact of NLP ? Predict, Execute, Communicate ?
  • 3. Query understanding? ●communication channel [searcher query & search engine] ●FOCUS on WHAT searcher wants ●Measure & optimize query performance ●perfect “search probe” ○Conversation with search engine ○independent of index ■ Focus Less about results ■ don’t expect auto filter irrelevant results Figure out what the user wants 3
  • 4. Clinical Phase (Query ?) ●“My patient is not responding adequately to methotrexate, and I need to consider the next steps” 4
  • 5. 5
  • 6. Searcher? ● Persona ● User context ○ Current characters entered -> query ○ Location ○ Time ○ Date ○ Language ○ This session’s queries ○ History past queries ● Predict User’s intent right now? 6
  • 7. Spelling corrections, Autocomplete, & Instant Search ●aka: autosuggest / query completion ●Feedback loops (good here, not good in DS)
  • 8. Initial focus – Just NLP? ●OR - Focus PRIOR to calling Search ●Or - Strings to Suggesters ? ●When ? ○Before search engine scores ○important – nail it = 8
  • 9. Intent ~= Predict ●2 stage predict ? ○characters ■entered ■correct misspellings ■Influence of letter frequency ○Phrases ■Influence phrases / knowledge ○Counterfactual estimators ■Logs to predict (offline) 9
  • 12. Communication ●Chapter 13 AI-Powered Search (http://aipoweredsearch.com) “Semantic Search with Dense Vectors”. 12 Also strike out what is missing
  • 13. Search Query Thoughts 13 EVERY INSTANCE OF A WORD / PHRASE …..HAS A UNIQUE MEANING AT THIS TIME YOU CAN’T IMPROVE WHAT YOU CAN’T MEASURE
  • 14. Similar queries = similar results ? ●Latest breakthroughs in artificial intelligence ●Latest advancements in artificial intelligence ●Latest advancements on artificial intelligence ●Latest breakthroughs in AI ●Piu’ recenti sviluppi di ricercar sull’’intelligenza artificiale ○DYM: “Most recent research developments on artificial intelligence” ●Results different for the same basic query intent ●Search engines are “statistical frequency hunters” out of the box 14
  • 15. What does Love mean? 15
  • 16. Query Semantic graph: “Love”
  • 17. Query Love – Child Context
  • 18. Same query + intent / different results ●Query: Buy tickets Boston Garden tomorrow ●date processing incorrect or diff events on diff days ○Celtics Tuesday ○circus Wednesday ○sold out, suggest Thursday? ●Conversational AI ○next night tickets (context of past queries) ○Useful to disambiguate sometimes 18
  • 19. Processes: Search / AI pipelines 19
  • 20. Lucidworks Fusion – AI/Search Product (2018)
  • 21. Query Text Analysis -> Autocomplete / Instant Search ● Character Filtering diacritics, unicode, capitalization ● Terms tokenization (punctuation - commas, hyphens, apostrophes,…) ● Spell correction 10-15% misspelled queries, Solr 7.5 SolrTextTagger ● Inflection stemming / lemmatizing ● Syntactic analysis NLP = POS (Noun Phrases), NER(Entity recognition/extraction), ■ hierarchies (taxonomies and ontologies) ● Semantic analysis Vector (word2vec, node2vec), deep parse-dependency, BERT-context sensitive ■ Query segmentation, tagging, & scoping ■ Lexical Databases - knowledge graphs ■ Disambiguation ● Query re-writing add tokens to help ● Query relaxation do not consider all tokens [when no results – risky - careful] ● LTR / Adjustments make $$, crowd source signals ● Autocomplete yippy ! ● Instant search very cool if all aligns with confidence 🡪 run new search query / results
  • 22. User Query ●Machine Learning research and development Portland, OR software engineer Hadoop, java ●Traditional Lucene Query Parser ●(Machine and Learning and research and development and Portland) ●OR ●(software and engineer and Hadoop and java) YUK ! 22
  • 23. Character filtering ●Capitalization, diacritics, Unicode, Numbers, alphanumeric processing ●Careful ○conflates abbreviations to lower case ○Consider: process the same content many ways into fields
  • 24. Query => Collection of Terms ●Language tokenization ○(punctuation - commas, hyphens, apostrophes,…) ●N-gram tokenization ○(unigram) 1 word, (bi-gram) 2 word, (trigrams) 3 word ○n-gram char tokenization (2 letter, 3 letter grams) – useful for computing priors for spell checking ●Inflection = conflated strings (shorter term index, more documents) ○increases recall, precision may suffer ○Stemming and lemmatization ■ Both inflection styles may need customization to handle your words ■ (corporals, corporation, corporations 🡪 corp, for a stemmer ---hmmm)
  • 25. Spell correction = Candidate Generation ● queries misspelled 10-15% ○ missing/extra/swapping/replacing letters ○ leverage n-gram letter language models ■ priors of frequency of letters ■ 2-grams, and tri-grams ● https://norvig.com/spell-correct.html ● https://en.wikipedia.org/wiki/Noisy_channel_model ● https://web.stanford.edu/~jurafsky/slp3/3.pdf ● https://solr.apache.org/guide/6_6/suggester.html ● Solr 7.5 TextTagger
  • 26. Query re-writing ●Improve query performance ●Process: add abbreviations, synonyms, common miss spellings -> proper spelling (hash) ●additional tokens ○ For recall (adding in synonyms, abbreviations, alternate phrasings, missspelling) ○ For precision (pair wise boost of query terms, consider POS tags) 4 term query has abbreviation: ABC_abbrev ○ A B C D ●Simple re-written query – abbrev + pairwise boost ○ (A B C D) or (ABC_abbrev ) and ( “A B”~1000 or “B C”~1000 or “C D”~1000) ○ Even better when leveraging related semantic items
  • 27. Query relaxation ●When ○no results ? ○Careful - riskier ●Consider this approach – highlight matching terms ○may want to show the results holding back part of the query not matched (cross it out missing terms on presenting the result) documents ○ignore stop words ○drop specificity - white HDMI cable – HDMI cable color may or may not be important
  • 28. Entity recognition ●strings to things ○ synonyms (isa = equivalent terms, careful) ○ hypernyms (hypernym =Footwear instead of a more specific (hyponyms) sneakers or hiking boots) ○ taxonomies (hierarchical lists) ○ ontologies ( = taxonomy with relations , this drug treats this condition) ○ knowledge graphs ○Related?= weighted synonyms ? [ if language based ] ■ vectors (word, para, doc, graph) to vector (language based better sometimes – longer queries, not better on short queries) - word2vec- not synonym - placement, glove, BERT(language) ■ Nearest neighbor search similarity ■ effective AI-driven autocomplete (semantic search) (http://aipoweredsearch.com) ■ “Semantic Search with Dense Vectors” https://www.manning.com/books/ai-powered- search?a_aid=1&a_bid=e47ada24&chan=aips
  • 29. Syntactic analysis ●NLP POS (Natural Language Processing /Part of speech) ○shallow parse ○Noun Phrases ●NER (named entities, relations, semantic types, weighted, implied/direct matches)
  • 30. Machine Learning research and development Portland, OR software engineer Hadoop, java 30 NLP noun phrases • machine learning • research and development • Portland, OR • software engineer • Hadoop • java
  • 31. Semantics ●Vectors (word, doc, graph parse/traversal) ○not the actual word but an embedding [computable space] ■ (footwear -> high heels, foot wear -> white socks) ○query classification ○query semantic matches (relaxation to different concepts where it makes sense) ■ Disambiguation ■ spelling correction ●Influence of business rules (more $$ result) (careful $ != happy users)
  • 32. Search Suggestions 2 jobs ●Reduce searcher effort ●Improve query performance 32
  • 34. Query segmentation - semantic units ●(classes noun phrases / relations) -dictionary or statistical ●One Suggestion ○from content (initial proxy for NP frequency) ○query processing (query logs) ○combination (covers queries aligned with content) ●Learn Related/Similar (Deep Learning) – Neural Search ○Word representations (learn relatedness = synonyms, antonyms) ○Document representations (classifier, level of writing, etc) ○DL accels here: ■ Image representations – binary – not caption text ■ Query can be an image – finds similar images to it
  • 36. Long quiz question (query) ●Query: Geriatric diabetic male concerned about wound care pain options? ●Require concepts for: Geriatric, male, pain – get result count ? ○If zero results – query relax process until results exist ■ (over specified to the content) ■ Reduce required concepts (context to respond) ■ some concepts are not needed (created a list) ●Precision: Rank order leveraging the non-concept words ○(“wound, care, concern, “options”) ●Why: Boosting context results = generic context ○(Content 1.5M diabetic documents responses) 36
  • 37. Lexical Databases ●knowledge graphs ●hierarchies (taxonomies / ontologies) ●cool how to leverage knowledge graphs ○ https://wiki.pathmind.com/graph-analysis ●Graph Matching Networks for Learning the Similarity of Graph Structured Objects
  • 38. Words / Meanings ●Words are ambiguous ●Meanings are not 38
  • 39. Meaning manifolds – Computable Space (Semantic) - https://arxiv.org/pdf/2011.09413.pdf 39
  • 40. Noun Phrases -> Semantically close manifolds 40 Machine learning Noun Phrase Near match Another near match Machine learning Data scientist AI or ML or DL Research and development R&D Applied research Portland, OR Portland, Oregon Geo_location Software engineer Software developer AI programmer hadoop Big data Hive or Spark or Databricks or SnowFlake or Google Big Table
  • 42. Other Similarity Ranked Response techniques ●BRF: Blind Relevance Feedback ○ short queries to long queries – dipping into longer docs to find MLT vector to find things like it in other short content (Sherpath {Nursing 101 Courseware} leveraged this initially) ○ Topic – docs as bag of topics (top TF/IDF terms from top n-results) ○ Ex. sterilization – topic to cover – determined the top MLT query terms and results in matching Questions to ask a student ■ Proper hand washing ■ planned parenthood (ball snipping – aka make a male sterile) ●MLT (More like words) or MLT (semantic – not a word matcher = ANN, SKG) ●LLT (Less Like this) – Umass (B. Croft 1990s) – Loooong Boolean expressions ●Recommenders – ranked response predicting the user next action ○ (huge in what to purchase/listen to next (Amazon, Netflix, Spotify, …) ○ Multi-armed, stochastic bandits (A/B testing) ○ https://www.aaai.org/AAAI21Papers/AAAI-8642.LuS.pdf
  • 43. Instant search ● Skip auto complete – ○ Run the query (search probe) ○ Visibile communication Mechanisms ■ highlight was is important in a search result ■ Cross out what a given result is missing from the search probe ■ Finding Love in content
  • 44. Test ●New Technology ○search better unless ■ HRT/SRF – proven better ■ combine ● Evaluated performance – total engagement (search value) ● Low query performance ( ) ● Overall better ■ cost effective 44
  • 45. References ● Max Irwin - BERT Search experience https://opensourceconnections.com/blog/2021/09/01/the-bert-search-experience ● Daniel Tunkelang - Query Understanding https://queryunderstanding.com/introduction-c98740502103 ● Trey Grainger ○ User Intent - AI powered search https://www.youtube.com/watch?v=sTWMn0LSoiA ○ Semantic Knowledge Graph https://www.treygrainger.com/posts/page/2/ (27 mins in is the step by step technology) ○ code https://github.com/careerbuilder/semantic-knowledge-graph ● State of the Art in NLP nlpprogress.com/ ● Neural Search – Neural Info retrieval - SIGIR 2016 – www.Microsoft.com/en-us/research/event/neuir2016 ● Deep Learning 4 search https://github.com/dl4s/dl4s ● Solr Graph Traversal https://solr.apache.org/guide/6_6/graph-traversal.html ● Graph neural networks (GNNs) ○ https://distill.pub/2021/gnn-intro/ ○ https://distill.pub/2021/understanding-gnns/ ○ https://quantdare.com/understanding-neural-networks-with-graphs/ ○ There is more nuance than just putting things in graphs for retrieval – user intent
  • 46. Author Biographies – Thank you for listening ● Matt Corkum ● corkum@gmail.com 904 772 5383 ● Founding member Elsevier Labs (2001) and Altavista (1995) ● Interests: Search, NLP, ML/DL, 3D graphics, Big Data Computing/Distributed Processing ● LinkedIn: https://www.linkedin.com/in/matt- corkum-347b0b/ ● Twitter: https://twitter.com/matt_corkum 46
  • 47. Slides not making the cut 47
  • 48.
  • 49. Docs -> terms, query -> terms