3. Query understanding?
●communication channel [searcher query & search engine]
●FOCUS on WHAT searcher wants
●Measure & optimize query performance
●perfect “search probe”
○Conversation with search engine
○independent of index
■ Focus Less about results
■ don’t expect auto filter irrelevant results
Figure out what the user
wants
3
4. Clinical Phase (Query ?)
●“My patient is not responding adequately to methotrexate, and I need to
consider the next steps”
4
6. Searcher?
● Persona
● User context
○ Current characters entered ->
query
○ Location
○ Time
○ Date
○ Language
○ This session’s queries
○ History past queries
● Predict User’s intent right now?
6
12. Communication
●Chapter 13 AI-Powered Search
(http://aipoweredsearch.com)
“Semantic Search with Dense
Vectors”.
12
Also strike out what is missing
13. Search Query Thoughts
13
EVERY INSTANCE OF A
WORD / PHRASE
…..HAS A UNIQUE
MEANING AT THIS TIME
YOU CAN’T IMPROVE
WHAT YOU CAN’T
MEASURE
14. Similar queries = similar results ?
●Latest breakthroughs in artificial intelligence
●Latest advancements in artificial intelligence
●Latest advancements on artificial intelligence
●Latest breakthroughs in AI
●Piu’ recenti sviluppi di ricercar sull’’intelligenza artificiale
○DYM: “Most recent research developments on artificial intelligence”
●Results different for the same basic query intent
●Search engines are “statistical frequency hunters” out of the box
14
18. Same query + intent / different results
●Query: Buy tickets Boston Garden tomorrow
●date processing incorrect or diff events on diff days
○Celtics Tuesday
○circus Wednesday
○sold out, suggest Thursday?
●Conversational AI
○next night tickets (context of past queries)
○Useful to disambiguate sometimes
18
21. Query Text Analysis ->
Autocomplete / Instant Search
● Character Filtering diacritics, unicode, capitalization
● Terms tokenization (punctuation - commas, hyphens, apostrophes,…)
● Spell correction 10-15% misspelled queries, Solr 7.5 SolrTextTagger
● Inflection stemming / lemmatizing
● Syntactic analysis NLP = POS (Noun Phrases), NER(Entity recognition/extraction),
■ hierarchies (taxonomies and ontologies)
● Semantic analysis Vector (word2vec, node2vec), deep parse-dependency, BERT-context sensitive
■ Query segmentation, tagging, & scoping
■ Lexical Databases - knowledge graphs
■ Disambiguation
● Query re-writing add tokens to help
● Query relaxation do not consider all tokens [when no results – risky - careful]
● LTR / Adjustments make $$, crowd source signals
● Autocomplete yippy !
● Instant search very cool if all aligns with confidence 🡪 run new search query / results
22. User Query
●Machine Learning research and development Portland, OR software engineer
Hadoop, java
●Traditional Lucene Query Parser
●(Machine and Learning and research and development and Portland)
●OR
●(software and engineer and Hadoop and java)
YUK !
22
23. Character filtering
●Capitalization, diacritics, Unicode, Numbers, alphanumeric processing
●Careful
○conflates abbreviations to lower case
○Consider: process the same content many ways into fields
24. Query => Collection of Terms
●Language tokenization
○(punctuation - commas, hyphens, apostrophes,…)
●N-gram tokenization
○(unigram) 1 word, (bi-gram) 2 word, (trigrams) 3 word
○n-gram char tokenization (2 letter, 3 letter grams) – useful for computing priors for spell checking
●Inflection = conflated strings (shorter term index, more documents)
○increases recall, precision may suffer
○Stemming and lemmatization
■ Both inflection styles may need customization to handle your words
■ (corporals, corporation, corporations 🡪 corp, for a stemmer ---hmmm)
25. Spell correction = Candidate Generation
● queries misspelled 10-15%
○ missing/extra/swapping/replacing letters
○ leverage n-gram letter language models
■ priors of frequency of letters
■ 2-grams, and tri-grams
● https://norvig.com/spell-correct.html
● https://en.wikipedia.org/wiki/Noisy_channel_model
● https://web.stanford.edu/~jurafsky/slp3/3.pdf
● https://solr.apache.org/guide/6_6/suggester.html
● Solr 7.5 TextTagger
26. Query re-writing
●Improve query performance
●Process: add abbreviations, synonyms, common miss spellings -> proper
spelling (hash)
●additional tokens
○ For recall (adding in synonyms, abbreviations, alternate phrasings, missspelling)
○ For precision (pair wise boost of query terms, consider POS tags)
4 term query has abbreviation: ABC_abbrev
○ A B C D
●Simple re-written query – abbrev + pairwise boost
○ (A B C D) or (ABC_abbrev ) and ( “A B”~1000 or “B C”~1000 or “C D”~1000)
○ Even better when leveraging related semantic items
27. Query relaxation
●When
○no results ?
○Careful - riskier
●Consider this approach – highlight matching terms
○may want to show the results holding back part of the query not matched (cross it out missing
terms on presenting the result) documents
○ignore stop words
○drop specificity - white HDMI cable – HDMI cable color may or may not be important
28. Entity recognition
●strings to things
○ synonyms (isa = equivalent terms, careful)
○ hypernyms (hypernym =Footwear instead of a more specific (hyponyms) sneakers or hiking boots)
○ taxonomies (hierarchical lists)
○ ontologies ( = taxonomy with relations , this drug treats this condition)
○ knowledge graphs
○Related?= weighted synonyms ? [ if language based ]
■ vectors (word, para, doc, graph) to vector (language based better sometimes – longer queries, not better on short
queries) - word2vec- not synonym - placement, glove, BERT(language)
■ Nearest neighbor search similarity
■ effective AI-driven autocomplete (semantic search) (http://aipoweredsearch.com)
■ “Semantic Search with Dense Vectors” https://www.manning.com/books/ai-powered-
search?a_aid=1&a_bid=e47ada24&chan=aips
30. Machine Learning research and development
Portland, OR software engineer Hadoop, java
30
NLP noun phrases
• machine learning
• research and development
• Portland, OR
• software engineer
• Hadoop
• java
31. Semantics
●Vectors (word, doc, graph parse/traversal)
○not the actual word but an embedding [computable space]
■ (footwear -> high heels, foot wear -> white socks)
○query classification
○query semantic matches (relaxation to different concepts where it makes sense)
■ Disambiguation
■ spelling correction
●Influence of business rules (more $$ result) (careful $ != happy users)
36. Long quiz question (query)
●Query: Geriatric diabetic male concerned about wound care pain options?
●Require concepts for: Geriatric, male, pain – get result count ?
○If zero results – query relax process until results exist
■ (over specified to the content)
■ Reduce required concepts (context to respond)
■ some concepts are not needed (created a list)
●Precision: Rank order leveraging the non-concept words
○(“wound, care, concern, “options”)
●Why: Boosting context results = generic context
○(Content 1.5M diabetic documents responses)
36
37. Lexical Databases
●knowledge graphs
●hierarchies (taxonomies / ontologies)
●cool how to leverage knowledge graphs
○ https://wiki.pathmind.com/graph-analysis
●Graph Matching Networks for Learning the Similarity of Graph Structured
Objects
39. Meaning manifolds – Computable Space (Semantic) -
https://arxiv.org/pdf/2011.09413.pdf
39
40. Noun Phrases -> Semantically close manifolds
40
Machine learning
Noun Phrase Near match Another near match
Machine learning Data scientist AI or ML or DL
Research and development R&D Applied research
Portland, OR Portland, Oregon Geo_location
Software engineer Software developer AI programmer
hadoop Big data Hive or Spark or Databricks or
SnowFlake or Google Big Table
42. Other Similarity Ranked Response techniques
●BRF: Blind Relevance Feedback
○ short queries to long queries – dipping into longer docs to find MLT vector to find things like it in
other short content (Sherpath {Nursing 101 Courseware} leveraged this initially)
○ Topic – docs as bag of topics (top TF/IDF terms from top n-results)
○ Ex. sterilization – topic to cover – determined the top MLT query terms and results in matching
Questions to ask a student
■ Proper hand washing
■ planned parenthood (ball snipping – aka make a male sterile)
●MLT (More like words) or MLT (semantic – not a word matcher = ANN, SKG)
●LLT (Less Like this) – Umass (B. Croft 1990s) – Loooong Boolean expressions
●Recommenders – ranked response predicting the user next action
○ (huge in what to purchase/listen to next (Amazon, Netflix, Spotify, …)
○ Multi-armed, stochastic bandits (A/B testing)
○ https://www.aaai.org/AAAI21Papers/AAAI-8642.LuS.pdf
43. Instant search
● Skip auto complete –
○ Run the query (search probe)
○ Visibile communication Mechanisms
■ highlight was is important in a search result
■ Cross out what a given result is missing from the search probe
■ Finding Love in content
45. References
● Max Irwin - BERT Search experience https://opensourceconnections.com/blog/2021/09/01/the-bert-search-experience
● Daniel Tunkelang - Query Understanding https://queryunderstanding.com/introduction-c98740502103
● Trey Grainger
○ User Intent - AI powered search https://www.youtube.com/watch?v=sTWMn0LSoiA
○ Semantic Knowledge Graph https://www.treygrainger.com/posts/page/2/ (27 mins in is the step by step technology)
○ code https://github.com/careerbuilder/semantic-knowledge-graph
● State of the Art in NLP nlpprogress.com/
● Neural Search – Neural Info retrieval - SIGIR 2016 – www.Microsoft.com/en-us/research/event/neuir2016
● Deep Learning 4 search https://github.com/dl4s/dl4s
● Solr Graph Traversal https://solr.apache.org/guide/6_6/graph-traversal.html
● Graph neural networks (GNNs)
○ https://distill.pub/2021/gnn-intro/
○ https://distill.pub/2021/understanding-gnns/
○ https://quantdare.com/understanding-neural-networks-with-graphs/
○ There is more nuance than just putting things in graphs for retrieval – user intent
46. Author Biographies – Thank you for listening
● Matt Corkum
● corkum@gmail.com 904 772 5383
● Founding member Elsevier Labs (2001) and
Altavista (1995)
● Interests: Search, NLP, ML/DL, 3D graphics,
Big Data Computing/Distributed Processing
● LinkedIn: https://www.linkedin.com/in/matt-
corkum-347b0b/
● Twitter: https://twitter.com/matt_corkum
46