Query Understanding

Search 101 Workshop: 1.2 Query Understanding
1
Context, Similarity, and Recommendations
Matt Corkum
May 2022

Query understanding?
impact of NLP ?
Predict, Execute, Communicate ?

Query understanding?
●communication channel [searcher query & search engine]
●FOCUS on WHAT searcher wants
●Measure & optimize query performance
●perfect “search probe”
○Conversation with search engine
○independent of index
■ Focus Less about results
■ don’t expect auto filter irrelevant results
Figure out what the user
wants
3

Clinical Phase (Query ?)
●“My patient is not responding adequately to methotrexate, and I need to
consider the next steps”
4

Searcher?
● Persona
● User context
○ Current characters entered ->
query
○ Location
○ Time
○ Date
○ Language
○ This session’s queries
○ History past queries
● Predict User’s intent right now?
6

Spelling corrections, Autocomplete, & Instant Search
●aka: autosuggest / query completion
●Feedback loops (good here, not good in DS)

Initial focus – Just NLP?
●OR - Focus PRIOR to calling
Search
●Or - Strings to Suggesters ?
●When ?
○Before search engine
scores
○important – nail it =
8

Intent ~= Predict
●2 stage predict ?
○characters
■entered
■correct misspellings
■Influence of letter
frequency
○Phrases
■Influence phrases /
knowledge
○Counterfactual estimators
■Logs to predict (offline)
9

Communication
●Chapter 13 AI-Powered Search
(http://aipoweredsearch.com)
“Semantic Search with Dense
Vectors”.
12
Also strike out what is missing

Search Query Thoughts
13
EVERY INSTANCE OF A
WORD / PHRASE
…..HAS A UNIQUE
MEANING AT THIS TIME
YOU CAN’T IMPROVE
WHAT YOU CAN’T
MEASURE

Similar queries = similar results ?
●Latest breakthroughs in artificial intelligence
●Latest advancements in artificial intelligence
●Latest advancements on artificial intelligence
●Latest breakthroughs in AI
●Piu’ recenti sviluppi di ricercar sull’’intelligenza artificiale
○DYM: “Most recent research developments on artificial intelligence”
●Results different for the same basic query intent
●Search engines are “statistical frequency hunters” out of the box
14

Query Semantic graph: “Love”

Same query + intent / different results
●Query: Buy tickets Boston Garden tomorrow
●date processing incorrect or diff events on diff days
○Celtics Tuesday
○circus Wednesday
○sold out, suggest Thursday?
●Conversational AI
○next night tickets (context of past queries)
○Useful to disambiguate sometimes
18

Processes: Search / AI pipelines
19

Lucidworks Fusion – AI/Search Product (2018)

Query Text Analysis ->
Autocomplete / Instant Search
● Character Filtering diacritics, unicode, capitalization
● Terms tokenization (punctuation - commas, hyphens, apostrophes,…)
● Spell correction 10-15% misspelled queries, Solr 7.5 SolrTextTagger
● Inflection stemming / lemmatizing
● Syntactic analysis NLP = POS (Noun Phrases), NER(Entity recognition/extraction),
■ hierarchies (taxonomies and ontologies)
● Semantic analysis Vector (word2vec, node2vec), deep parse-dependency, BERT-context sensitive
■ Query segmentation, tagging, & scoping
■ Lexical Databases - knowledge graphs
■ Disambiguation
● Query re-writing add tokens to help
● Query relaxation do not consider all tokens [when no results – risky - careful]
● LTR / Adjustments make $$, crowd source signals
● Autocomplete yippy !
● Instant search very cool if all aligns with confidence 🡪 run new search query / results

User Query
●Machine Learning research and development Portland, OR software engineer
Hadoop, java
●Traditional Lucene Query Parser
●(Machine and Learning and research and development and Portland)
●OR
●(software and engineer and Hadoop and java)
YUK !
22

Character filtering
●Capitalization, diacritics, Unicode, Numbers, alphanumeric processing
●Careful
○conflates abbreviations to lower case
○Consider: process the same content many ways into fields

Query => Collection of Terms
●Language tokenization
○(punctuation - commas, hyphens, apostrophes,…)
●N-gram tokenization
○(unigram) 1 word, (bi-gram) 2 word, (trigrams) 3 word
○n-gram char tokenization (2 letter, 3 letter grams) – useful for computing priors for spell checking
●Inflection = conflated strings (shorter term index, more documents)
○increases recall, precision may suffer
○Stemming and lemmatization
■ Both inflection styles may need customization to handle your words
■ (corporals, corporation, corporations 🡪 corp, for a stemmer ---hmmm)

Spell correction = Candidate Generation
● queries misspelled 10-15%
○ missing/extra/swapping/replacing letters
○ leverage n-gram letter language models
■ priors of frequency of letters
■ 2-grams, and tri-grams
● https://norvig.com/spell-correct.html
● https://en.wikipedia.org/wiki/Noisy_channel_model
● https://web.stanford.edu/~jurafsky/slp3/3.pdf
● https://solr.apache.org/guide/6_6/suggester.html
● Solr 7.5 TextTagger

Query re-writing
●Improve query performance
●Process: add abbreviations, synonyms, common miss spellings -> proper
spelling (hash)
●additional tokens
○ For recall (adding in synonyms, abbreviations, alternate phrasings, missspelling)
○ For precision (pair wise boost of query terms, consider POS tags)
4 term query has abbreviation: ABC_abbrev
○ A B C D
●Simple re-written query – abbrev + pairwise boost
○ (A B C D) or (ABC_abbrev ) and ( “A B”~1000 or “B C”~1000 or “C D”~1000)
○ Even better when leveraging related semantic items

Query relaxation
●When
○no results ?
○Careful - riskier
●Consider this approach – highlight matching terms
○may want to show the results holding back part of the query not matched (cross it out missing
terms on presenting the result) documents
○ignore stop words
○drop specificity - white HDMI cable – HDMI cable color may or may not be important

Entity recognition
●strings to things
○ synonyms (isa = equivalent terms, careful)
○ hypernyms (hypernym =Footwear instead of a more specific (hyponyms) sneakers or hiking boots)
○ taxonomies (hierarchical lists)
○ ontologies ( = taxonomy with relations , this drug treats this condition)
○ knowledge graphs
○Related?= weighted synonyms ? [ if language based ]
■ vectors (word, para, doc, graph) to vector (language based better sometimes – longer queries, not better on short
queries) - word2vec- not synonym - placement, glove, BERT(language)
■ Nearest neighbor search similarity
■ effective AI-driven autocomplete (semantic search) (http://aipoweredsearch.com)
■ “Semantic Search with Dense Vectors” https://www.manning.com/books/ai-powered-
search?a_aid=1&a_bid=e47ada24&chan=aips

Syntactic analysis
●NLP POS (Natural Language Processing /Part of speech)
○shallow parse
○Noun Phrases
●NER (named entities, relations, semantic types, weighted, implied/direct
matches)

Machine Learning research and development
Portland, OR software engineer Hadoop, java
30
NLP noun phrases
• machine learning
• research and development
• Portland, OR
• software engineer
• Hadoop
• java

Semantics
●Vectors (word, doc, graph parse/traversal)
○not the actual word but an embedding [computable space]
■ (footwear -> high heels, foot wear -> white socks)
○query classification
○query semantic matches (relaxation to different concepts where it makes sense)
■ Disambiguation
■ spelling correction
●Influence of business rules (more $$ result) (careful $ != happy users)

Search Suggestions 2 jobs
●Reduce searcher effort
●Improve query performance
32

Query segmentation - semantic units
●(classes noun phrases / relations) -dictionary or statistical
●One Suggestion
○from content (initial proxy for NP frequency)
○query processing (query logs)
○combination (covers queries aligned with content)
●Learn Related/Similar (Deep Learning) – Neural Search
○Word representations (learn relatedness = synonyms, antonyms)
○Document representations (classifier, level of writing, etc)
○DL accels here:
■ Image representations – binary – not caption text
■ Query can be an image – finds similar images to it

Long quiz question (query)
●Query: Geriatric diabetic male concerned about wound care pain options?
●Require concepts for: Geriatric, male, pain – get result count ?
○If zero results – query relax process until results exist
■ (over specified to the content)
■ Reduce required concepts (context to respond)
■ some concepts are not needed (created a list)
●Precision: Rank order leveraging the non-concept words
○(“wound, care, concern, “options”)
●Why: Boosting context results = generic context
○(Content 1.5M diabetic documents responses)
36

Lexical Databases
●knowledge graphs
●hierarchies (taxonomies / ontologies)
●cool how to leverage knowledge graphs
○ https://wiki.pathmind.com/graph-analysis
●Graph Matching Networks for Learning the Similarity of Graph Structured
Objects

Words / Meanings
●Words are ambiguous
●Meanings are not
38

Meaning manifolds – Computable Space (Semantic) -
https://arxiv.org/pdf/2011.09413.pdf
39

Noun Phrases -> Semantically close manifolds
40
Machine learning
Noun Phrase Near match Another near match
Machine learning Data scientist AI or ML or DL
Research and development R&D Applied research
Portland, OR Portland, Oregon Geo_location
Software engineer Software developer AI programmer
hadoop Big data Hive or Spark or Databricks or
SnowFlake or Google Big Table

Query, Parsing, Semantics, Expanded

Other Similarity Ranked Response techniques
●BRF: Blind Relevance Feedback
○ short queries to long queries – dipping into longer docs to find MLT vector to find things like it in
other short content (Sherpath {Nursing 101 Courseware} leveraged this initially)
○ Topic – docs as bag of topics (top TF/IDF terms from top n-results)
○ Ex. sterilization – topic to cover – determined the top MLT query terms and results in matching
Questions to ask a student
■ Proper hand washing
■ planned parenthood (ball snipping – aka make a male sterile)
●MLT (More like words) or MLT (semantic – not a word matcher = ANN, SKG)
●LLT (Less Like this) – Umass (B. Croft 1990s) – Loooong Boolean expressions
●Recommenders – ranked response predicting the user next action
○ (huge in what to purchase/listen to next (Amazon, Netflix, Spotify, …)
○ Multi-armed, stochastic bandits (A/B testing)
○ https://www.aaai.org/AAAI21Papers/AAAI-8642.LuS.pdf

Instant search
● Skip auto complete –
○ Run the query (search probe)
○ Visibile communication Mechanisms
■ highlight was is important in a search result
■ Cross out what a given result is missing from the search probe
■ Finding Love in content

Test
●New Technology
○search better unless
■ HRT/SRF – proven better
■ combine
● Evaluated performance – total engagement (search value)
● Low query performance ( )
● Overall better
■ cost effective
44

References
● Max Irwin - BERT Search experience https://opensourceconnections.com/blog/2021/09/01/the-bert-search-experience
● Daniel Tunkelang - Query Understanding https://queryunderstanding.com/introduction-c98740502103
● Trey Grainger
○ User Intent - AI powered search https://www.youtube.com/watch?v=sTWMn0LSoiA
○ Semantic Knowledge Graph https://www.treygrainger.com/posts/page/2/ (27 mins in is the step by step technology)
○ code https://github.com/careerbuilder/semantic-knowledge-graph
● State of the Art in NLP nlpprogress.com/
● Neural Search – Neural Info retrieval - SIGIR 2016 – www.Microsoft.com/en-us/research/event/neuir2016
● Deep Learning 4 search https://github.com/dl4s/dl4s
● Solr Graph Traversal https://solr.apache.org/guide/6_6/graph-traversal.html
● Graph neural networks (GNNs)
○ https://distill.pub/2021/gnn-intro/
○ https://distill.pub/2021/understanding-gnns/
○ https://quantdare.com/understanding-neural-networks-with-graphs/
○ There is more nuance than just putting things in graphs for retrieval – user intent

Author Biographies – Thank you for listening
● Matt Corkum
● corkum@gmail.com 904 772 5383
● Founding member Elsevier Labs (2001) and
Altavista (1995)
● Interests: Search, NLP, ML/DL, 3D graphics,
Big Data Computing/Distributed Processing
● LinkedIn: https://www.linkedin.com/in/matt-
corkum-347b0b/
● Twitter: https://twitter.com/matt_corkum
46

Query Understanding

Recommandé

Recommandé

Contenu connexe

Similaire à Query Understanding

Similaire à Query Understanding (20)

Dernier

Dernier (20)

Query Understanding