SlideShare une entreprise Scribd logo
1  sur  204
Using Topic
Modelling To
Win Big with
NLP &
Semantic
Search
Dawn Anderson
@DawnieAndo from
@MoveItMarketing
If I said to you…
“I’ve got a new
jaguar”
“It’s in the garage”
(sidenote: this is not my garage)
You probably
wouldn’t expect
to see this
Who said anything
about cars
“I’ve got a
new jag”
“Jag is
neither a car
nor a cat”
Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
Polysemy in linguistics is
problematic
DISCLAIMER
I am NOT a data
scientist
But I will be talking about some concepts
covering:
Data Sciene
01
Information
Retrieval
02
Algorithms
03
Linguistics
04
Information
Architecture
05
Library
Science
Which are areas
relevant to search
industry
These are all connected
to how search engines
find the right
information, for the right
informational need at
the right time
‘information
retrieval’
To extract informational resources to meet a
search engine user’s information need at time of
query.
CRAWL INDEX SERVE
Crawl (and render) the haystack = Crawling frontier
Organise the straw into bales – indexing (2 wave)
Chunking and Tokenization
Inverted
Index: Text to
Doc ID
Mapping
To then…
But there is
so much
hay
8 YEARS
AGO
Every day there
are huge volumes
of new indexable
data
But we only want to return one (or a few) needle (s) of hay
Accelerating
technological
developments have
made search even more
complicated
One example … Google’s
mobile-first indexing plans
MOBILE-FIRST GOES FAR BEYOND WEBSITES
Time and space, distance, speed of
movement come into play
Contextual
Search
Exacerbates
everything
further
A lot of users might be
interested in topical foraging too
(information foraging theory)
They might want to learn about whole topic of hay or straw
900
Or they may be researching to
buy a car and want lots of
different types of information
on cars
You think
there are
several
types of
SEO?
Local SEO
Technical SEO
Schema Specialist
Content Marketer
Outreach Specialist
Digital PR
There are at
least as
many niche
areas of
Information
Retrieval
Mobile IR
Contextual
Search
Natural
Language
Processing
Conversational
Search
Similarity
Search
Recommender
Systems
Library Science
The problem is… words are
hard
Every other word
in the English
language has
multiple
meanings
But…If we understood a topic is
about cats we would recognize a
jaguar
On their own single words
have no semantic meaning
How can we understand
these word meanings?
Using structured
data is an obvious
way to disambiguate
Structured versus unstructured data
• Structured data – high
degree of organization
• Readily searchable by
simple search engine
algorithms or known search
operators (e.g. SQL)
• Logically organized
• Often stored in a relational
database
Relational
database
systems
Knowledge Graphs
Mapping RDF
Triples
Entities
Conversational search
The knowledge graph
is checked first
Ontology Driven Natural Language Processing
Image credit: IBM
https://www.ibm.com/developerworks/community/blogs/nlp/entry/ontology_driven_nlp
Even named entities can
be ambiguous /
polysemic
• Amadeus Mozart (composer)
• Mozart Street
• Mozart Cafe
Australian towns using English names
(NSW example only)
An area of IR dedicated to understanding the ambiguous
needs for queries with multiple meanings
How can we fill in
the gaps between
named entities?
When there
is so much
noise
There are still many
open challenges in
natural language
processing
Text cohesion
• Cohesion is
the grammatical and lexical linking
within a text or sentence that holds a
text together and gives it meaning.
• It is related to the broader concept
of coherence. (Wikipedia)
‘Topic Modelling’
According to Wikipedia:
“In machine learning and natural language processing, a
topic model is a type of statistical model for discovering the
abstract "topics" that occur in a collection of documents.
Topic modeling is a frequently used text-mining tool for
discovery of hidden semantic structures in a text body”.
A collection of
text based web
pages (corpus)
A collection of
text based
documents
(corpus)
Term
Clarification
Machine learning ->
dataset
Information retrieval
-> corpora / corpus
We can disambiguate
through co-occurrence
“You shall know a
word by the company
it keeps” (John Rupert
Firth, 1957)
Using similarity &
relatedness
Relatedness is NOT
about structured
data
2 words are similar if they
co-occur with similar words
First Level Relatedness –
Words that appear together
in the same sentence
2 words are similar if they occur in a given
grammatical relation with the same words
Harvest Peel Eat Slice
Second Level Relatedness –
words that co-occur with the
same ‘other’ words
We need more words
WordSim353
Dataset
Some words
with high
similarity or
relatedness
Word 1 Word 2 Human (mean)
tiger tiger 10
fuck sex 9.44
journey voyage 9.29
midday noon 9.29
dollar buck 9.22
money cash 9.15
coast shore 9.1
money cash 9.08
money currency 9.04
football soccer 9.03
magician wizard 9.02
type kind 8.97
gem jewel 8.96
car automobile 8.94
street avenue 8.88
asylum madhouse 8.87
boy lad 8.83
environment ecology 8.81
furnace stove 8.79
seafood lobster 8.7
mile kilometer 8.66
Maradona football 8.62
OPEC oil 8.59
king queen 8.58
murder manslaughter 8.53
money bank 8.5
computer software 8.5
Jerusalem Israel 8.46
vodka gin 8.46
planet star 8.45
A Moving Word ‘Context
Window’
Typical window size might be 5
Source Text
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
11 letters (5 left and 5 right of the moving target word)
To build vector
space models
Vector representations of words (Word Vectors)
Vector space
models
Word embeddings example
Nearest Neighbours (Similarity)
Evaluations
KNN – K-Nearest-Neighbour
Tensorflow &
Word2Vec
Continuous Bag of Words (CBOW)
Taking a continuous bag of
words with no context utilize a
context window of n size n-gram)
to ascertain words which are
similar or related using Euclidean
distances to create vector
models and word embeddings
The opposite of CBOW (continuous bag of words)
Skip-gram model
Feed it WordPairs
Both models learn
the weights of the
similarity and
relatedness
distances
Vector
space
models are
being
expanded
beyond
Word2Vec
Word2Vec
Doc2Vec
Sentence2Vec
Paragraph2Vec
Word2Vec
Single words
Word embeddings
01
Doc2Vec
Words & meta data
Word embeddings
Document embeddings
02
Sentence2Vec
Chunks of words
More context available
03
Paragraph2Vec
Full paragraphs
Even more context and
semantics
04
In order to understand what
words in documents constitute
‘relevance’ to a query
Testing Similarity and Relatedness
http://ws4jdemo.appspot.com
GloVe: Global
Vectors for
Word
Representation
• What is GloVe?
• “GloVe is an unsupervised learning
algorithm for obtaining vector
representations for words. Training is
performed on aggregated global word-word
co-occurrence statistics from a corpus, and
the resulting representations showcase
interesting linear substructures of the word
vector space.”
• (https://nlp.stanford.edu/projects/glove/)
Linear
Substructures
in GloVe
• Sometimes more than one word pair is needed to understand
meaning
• Particularly when the words are opposites of each other (e.g.
man and woman)
• By adding addition word pairs further semantic hints provide
context to understand meaning of concepts
GloVe: Nearest Neighbour
Cosine Similarity
• Nearest words to Frog
• https://nlp.stanford.edu/projects/glove/
Glove2Vec
Concept2Vec
Concept2Vec
Ontological
concepts
Concept Graphs Using Relatedness
Wikipedia is a gold mine for IR researchers – each page is considered a concept
Similarity and
Relatedness
Similarity – words that mean the
same or nearly the same
Relatedness - Words that live
together within a topic / co-occur
in the same corpora / collection /
sub-section of a collection
Part of speech
tagging (POS)
‘Part of
Speech’
(POS)
tagging
A website is NOT unstructured data
It has a hierarchy
It has weighted
sections
It has metadata
It (often) has a
tree like
structure
• BM25
• BM25+
• BM25L
• OKAPI BM25
BM = BEST MATCH
On long documents
BM25 fails
Probably
BM25F is
used for
web pages
BM25F allows for web pages which have
structure compared with normal flat text
output (e.g. from text files) (additional fields)
Takes into consideration elements such as
page title, meta data, sections, footers,
headers, anchor text
Adds weights for different elements on a
page
Anchor text is
included in
BM25F
Semi-
structured
data
• Hierarchical nature of a
website
• Tree structure
• Well sectioned and
including clear containers
and meta headings
• An ontology map between
semi and structured
Lexical
‘nyms’
Antonym – The
opposite meaning
Synonym – The same
meaning
Meronym – Part of
something else (whole)
– e.g. finger (hand)
(Part / whole relations)
Hyponym – A subset of
something else – e.g.
fork (cutlery)
Hypernym – A superset
(superordinate) – e.g.
colour hypernym
TF:IDF
Term frequency:
Inverse document
frequency
TF:IDF LOCAL v GLOBAL?
(The whole document
collection)
Across your site??
Across all documents
relevant for the topic??
A website is NOT unstructured data
It has a
hierarchy
It has
weighted
sections
It has
metadata
It (often)
has a tree
like
structure
Keyword
Stuffing or
TF:IDF Weights
Query Intent Shift
“Easter” Query Intent Shift
Predicting the future
with Web Dynamics
• The journey to predict the future: Kira Radinsky at
TEDxHiriya
Find out what correlates and when
How can we improve
our topical relatedness?
Tell Me About Your Haystack
Cancel your noise
Unstructured data is
voluminous
Filled with
irrelevance
Lacks focus
Riddled with
stopwords
Lots of meaningless
text and further
ambiguating jabber
Disambiguate lean content
with powerful structured
data
And strong linking nearest
neighbour topically rich pages
Use well
organised
hyponyms
(Hyponomy and
Hypernymy)
• Cutlery
• Spoons
• Dessert
• Tea
• Table
• Forks
• Knives
• Carving
• Steak
• Butchers
Hypernym
Hyponym + Hypernym
(co) Hyponym
(co) Hyponym
(co) Hyponym
Hyponym + Hypernym
Hyponym + Hypernym
(co) Hyponym
(co) Hyponym
(co) Hyponym
Simple unordered list with children
Image alt tags (and image title
tags) help with disambiguation too
Stemming &
Lemmatization
Both aim to take a word back to its common
base form
Avoid keyword
stuffing… be aware of
stemming and
lemmatization
Tables are relational databases
too – use liberally (with headers)
ID Event Name Event Type Event City Event Country
1 Ungagged Las
Vegas
Conference Las Vegas US
2 Ungagged London Conference London UK
3 State of Digital Conference London UK
4 Brighton SEO Conference Brighton UK
Widget Logic / Widget Context
Stay in your
topical lane
Topical drift / dilution is
a big problem
Explore topical siloes
Merge content but watch out for topical
dilution – what did Wikipedia redirect?
The whole is greater
than the sum of its
parts
Check Wikipedia
redirects for your niche
Wikipedia redirects
• Dbo:wikiPageRedirects
Ludwig Van Beethoven
Ludwig Van
Beethoven
In theory… the consolidated page should rank
higher… but…
Extract the conversations
Throw the words into a word cloud
So the most
prominent
topics &
nuances appear
Watch out for topic dilution /
drift in user generated content
Educate crazy taggers but not before you’ve used their topic tags to fix
dilution
All the anchors &
contextual &
navigational
internal linking
Even if it is just a breadcrumb trail
Sources and References
• Kira Radinsky Tedx Talk -
https://www.youtube.com/watch?v=gAifa_CVGCY
• Stop Word Library Example -
https://sites.google.com/site/kevinbouge/stopwords-lists
• Image credit: Bird, Steven, Edward Loper and Ewan Klein
(2009), Natural Language Processing with Python. O’Reilly Media Inc.
• The work of - Radinsky, K., 2012, December. Learning to predict the
future using Web knowledge and dynamics. In ACM SIGIR Forum(Vol.
46, No. 2, pp. 114-115). ACM.
• http://9ol.es/porter_js_demo.html
Sources and References
• Lohar, P., Ganguly, D., Afli, H., Way, A. and Jones, G.J., 2016. FaDA:
Fast document aligner using word embedding. The Prague Bulletin of
Mathematical Linguistics, 106(1), pp.169-179.
• https://en.wikipedia.org/wiki/List_of_locations_in_Australia_with_an
_English_name
• Barbara Plank | Keynote - Natural Language Processing: -
https://www.youtube.com/watch?v=Wl6c0OpF6Ho
Further Reading
• https://github.com/Hironsan/awesome-embedding-models
• https://nlp.stanford.edu/IR-book/html/htmledition/document-
representations-and-measures-of-relatedness-in-vector-spaces-
1.html
• https://www.youtube.com/watch?time_continue=790&v=wI5O-
lYLBCw
• https://en.wikipedia.org/wiki/Euclidean_distance
• Ibrahim, O.A.S. and Landa-Silva, D., 2016. Term frequency with
average term occurrences for textual information retrieval. Soft
Computing, 20(8), pp.3045-3061.
Further Reading
• Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C. and McGinnity,
M., 2018. Advances in Computational Intelligence
Systems. Intelligence.
• https://www.researchgate.net/post/What_is_the_difference_betwee
n_TFIDF_and_term_distribution_for_feature_selection
• https://radimrehurek.com/gensim/models/word2vec.html
• McDonald, R., Brokos, G.I. and Androutsopoulos, I., 2018. Deep
relevance ranking using enhanced document-query interactions. arXiv
preprint arXiv:1809.01682.
Further Reading
Boyd-Graber, J., Hu, Y. and Mimno, D., 2017. Applications of topic
models. Foundations and Trends® in Information Retrieval, 11(2-3),
pp.143-296.
https://nlp.stanford.edu/projects/glove/
https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-
1.html
Sherkat, E. and Milios, E.E., 2017, June. Vector embedding of wikipedia
concepts and entities. In International conference on applications of
natural language to information systems (pp. 418-428). Springer, Cham.
Appendix
Precision and Recall
GOLD
STANDARD
Lots of results inaccurately
deemed highly relevant
retrieved. Lots of results
inaccurately deemed
irrelevant not retrieved
Maybe not enough relevant
documents to fetch much
here
Lots of documents came back
but not enough highly
relevant
Many highly relevant docs to
meet informational need
returned
Results were highly relevant
but not enough came back
Maybe being ‘too picky’
A wide net was cast but not
many of the right type of
fishes caught
https://medium.com/@klintcho/explaining-precision-and-recall-c770eb9c69e9
Manning, C.D., Manning, C.D. and Schütze, H., 1999. Foundations of statistical natural language processing. MIT press
Canadian towns using English names
US towns using English names
You should be aware
‘indexing’ and
‘ranking’ are two very
separate things
Example context window size 3
Source Text Training
Samples
The quick brown fox jumps over the lazy dog (the, quick)
(the,
brown)
(the, fox)
The quick brown fox jumps over the lazy dog (quick, the)
(quick,
brown)
(quick, fox)
(quick,
jumps)
The quick brown fox jumps over the lazy dog Etcetera
The quick brown fox jumps over the lazy dog Etcetera
Stemming
(Popular stemmer is PorterStemmer)
• https://github.com/johnpcarty/mysql-porter-
stemmer/blob/master/porterstemmer.sql
Conditions Suffix Replacement Examples
-------------------------- ------- ------------- -----------------------
(m>1) al NULL revival -> reviv
(m>1) ance NULL allowance -> allow
(m>1) ence NULL inference -> infer
(m>1) er NULL airliner-> airlin
(m>1) ic NULL gyroscopic -> gyroscop
(m>1) able NULL adjustable -> adjust
(m>1) ible NULL defensible -> defens
(m>1) ant NULL irritant -> irrit
(m>1) ement NULL replacement -> replac
(m>1) ment NULL adjustment -> adjust
(m>1) ent NULL dependent -> depend
(m>1 and (*<S> or *<T>)) ion NULL adoption -> adopt
(m>1) ou NULL homologou-> homolog
(m>1) ism NULL communism-> commun
(m>1) ate NULL activate -> activ
(m>1) iti NULL angulariti -> angular
(m>1) ous NULL homologous -> homolog
(m>1) ive NULL effective -> effect
(m>1) ize NULL bowdlerize -> bowdler
WordSim353
Dataset
Some words
with low
similarity or
relatedness
Word 1 Word 2 Human (mean)
king cabbage 0.23
professor cucumber 0.31
chord smile 0.54
noon string 0.54
rooster voyage 0.62
sugar approach 0.88
stock jaguar 0.92
stock life 0.92
monk slave 0.92
lad wizard 0.92
delay racism 1.19
stock CD 1.31
drink ear 1.31
stock phone 1.62
holy sex 1.62
production hike 1.75
precedent group 1.77
stock egg 1.81
energy secretary 1.81
month hotel 1.81
forest graveyard 1.85
cup substance 1.92
possibility girl 1.94
cemetery woodland 2.08
glass magician 2.08
cup entity 2.15
Wednesday news 2.22
direction combination 2.25
Coast and
Shore Example
• Coast and shore have a similar meaning
• They co-occur in first and second level
relatedness documents in a collection
• They would receive a high score in
similarity
SPARQL Editor &
WikiPageRedirects
There are 68
variants for
mobile
phone
redirected
on Wikipedia
Word’s Context
Context window (n = window size = 2) (2 words either side)
Source Text Training
Samples
The quick brown fox jumps over the lazy dog (the, quick)
(the,
brown)
The quick brown fox jumps over the lazy dog (quick, the)
(quick,
brown)
(quick, fox)
The quick brown fox jumps over the lazy dog (brown,
the)
(brown,
quick)
(brown, fox)
(brown,
jumps)
The quick brown fox jumps over the lazy dog (fox, quick)
(fox, brown)
(fox, jumps)
(fox, over)
Other areas of IR Including… but not limited to
Mobile IR
Contextual IR
Natural language processing
Conversational search
Similarity search
Recommender systems
Image IR
Music IR
Recall and Precision in IR
Precision is the
best results – the
most relevant for
the query
Recall is all of the
results returned
for the query
Likely based on
co-occurrence
data
https://slideplayer.com/slide/13138343/ - Query Expansion and Relevance Feedback
Increases recall (more
results) but may reduce
precision
Query Expansion Example
Increase Recall – 2 Main Methods
Query Expansion
/ Query Rewriting
Query Relaxation
• Ignore stop words in query
• Relax specificity (remove specific)
• Use lexical database (a “knowledge
graph” (e.g. Wordnet) to find a more
general term (hypernym - superset)
• Use Part of Speech Tagger to identify
structure of query & expand nouns
(things)
• Preserve head noun and strip
modifiers (e.g. dog)
• Use Word2Vec to identify semantics
from a vector space using word
embeddings from the query – find
semantics (related and similar)
Take bits away from the query
Add bits to the query
• Broaden the query
• Expand abbreviations
• Stemming and lemmatization (in
reverse)
• Use Word2Vec to find abbreviations
in a vector space using semantic
similarity
• Use synonyms (same / very similar
meanings)
• Use minimum cosine similarity from
Word2Vec as safety net
Wikipedia redirects
• Dbo:wikiPageRedirects
Most popular word embedding
tool – probably Word2Vec (new
ones are emerging)
Stemming &
Lemmatization
Stemming
• Runs a series of rules to
chop known ‘stems’ off the
end of words
• Suffix stripping algorithm
• Often leaves incorrect
endings/ crude performance
(understemming)
• Popular – PorterStemmer
(Martin Porter)
• Example: “alumnus” ->
“alumnu”
Lemmatization
• Aims to do things properly
• Tool from natural language
processing
• Needs a complete
vocabulary and
morphological analysis to
work well
• Also not perfect
• Relies on lexical knowledge
base like WordNet to correct
base form
Gensim
Part of Speech Tags (Python NLTK Library)
• NNPS proper noun, plural
‘Americans’
• PDT predeterminer ‘all the kids’
• POS possessive ending parent’s
• PRP personal pronoun I, he, she
• PRP$ possessive pronoun my, his,
hers
• RB adverb very, silently,
• RBR adverb, comparative better
• RBS adverb, superlative best
• RP particle give up
• TO, to go ‘to’ the store.
• UH interjection, errrrrrrrm
• CC coordinating conjunction
• CD cardinal digit
• DT determiner
• EX existential there (like: “there is” … think
of it like “there exists”)
• FW foreign word
• IN preposition/subordinating conjunction
• JJ adjective ‘big’
• JJR adjective, comparative ‘bigger’
• JJS adjective, superlative ‘biggest’
• LS list marker 1)
• MD modal could, will
• NN noun, singular ‘desk’
• NNS noun plural ‘desks’
• NNP proper noun, singular ‘Harrison’
• VB verb, base form take
• VBD verb, past tense took
• VBG verb, gerund/present
participle taking
• VBN verb, past participle taken
• VBP verb, sing. present, non-3d
take
• VBZ verb, 3rd person sing.
present takes
• WDT wh-determiner which
• WP wh-pronoun who, what
• WP$ possessive wh-pronoun
whose
• WRB wh-abverb where, when
Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
Stop Word Libraries are Huge
Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
a a's able about above according accordingly across actually after
afterwards again against ain't all allow allows almost alone along
already also although always am among amongst an and another
any anybody anyhow anyone anything anyway anyways anywhere apart appear
appreciate appropriate are aren't around as aside ask asking associate
d
at available away awfully
‘A’ words in an EN stop word list
• Anaphora resolution (AR) which most commonly appears as pronoun
resolution is the problem of resolving references to earlier or later
items in the discourse.
• Example: "John found the love of his life" where 'his' refers to 'John’
• ‘His’ refers to John (easily understood by humans but not so much by
machines)
Example and definition from: https://nlp.stanford.edu/courses/cs224n/2003/fp/iqsayed/project_report.pdf
Anaphora
Cataphora –
According to
Wikipedia
In linguistics cataphora is the use of an
expression or word that co-refers with a later,
more specific, expression in the discourse.
EXAMPLE: “When he arrived home, John went
to sleep”
HE IS JOHN, BUT JOHN WAS NOT KNOWN
WHEN ‘HE’ WAS REFERRED TO – SO CAUSES
CONFUSION REGARDING WHO ‘HE’ IS
Anaphora and
Coreference
Resolution
• There are some algorithms in place to handle
anaphora resolution
• In conversational search this still struggles after
a few multi-turn questions
NLTK Toolkit
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
Chunk =
Usually chunks
of words
Token =
Usually a word
Digital Marketing
Events – Using Tables
• https://research.google.com
/tables
Information Architecture for the World Wide Web, 3rd Edition by Louis Rosenfeld, Peter Morville
How can all this help you as an SEO?
• Consider exploring your topical models by using noindex admin only
tag clouds to visualise the topics you see
• Utilise relatedness considering words likely in Sim353 or other data
sources
• Pass supporting topical hints to emphasise the meanings in 1st and 2nd
level relatedness
• Consider query intent shift and mobile IR / contextual search as niche
fields of IR
• Utilise semi-structured elements to strengthen unstructured pages in
noisy (particularly longer) pages
How can all this help you as an SEO?
• Further disambiguation measures on locations used in different
countries with same name
• Be consistent in your naming conventions. Refer to Wikipedia if in
doubt – check their redirects on terms
• Other semantic clues for entities with same name – e.g. gender / role
/ location / geographic clues
• Utilise anchors to emphasise further from 2nd level relatedness pages
• Utilise co-occurring terms from databases considered similar /
connected / related

Contenu connexe

Tendances

Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
Smart Data Webinar: Advances in Natural Language Processing I - Understanding
Smart Data Webinar: Advances in Natural Language Processing I - UnderstandingSmart Data Webinar: Advances in Natural Language Processing I - Understanding
Smart Data Webinar: Advances in Natural Language Processing I - UnderstandingDATAVERSITY
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Lucidworks
 
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroPaul Shapiro
 
SearchLove London - Analysing the SERPs for SEO, Content & Customer Insights
SearchLove London - Analysing the SERPs for SEO, Content & Customer InsightsSearchLove London - Analysing the SERPs for SEO, Content & Customer Insights
SearchLove London - Analysing the SERPs for SEO, Content & Customer InsightsRory Truesdale
 
Natural language search using Neo4j
Natural language search using Neo4jNatural language search using Neo4j
Natural language search using Neo4jKenny Bastani
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4jKenny Bastani
 
SearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your Audience
SearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your AudienceSearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your Audience
SearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your AudienceDistilled
 
Conversational AI for Real Estate
Conversational AI for Real EstateConversational AI for Real Estate
Conversational AI for Real EstateInman News
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeMarianne Sweeny
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
New Concepts: Timespan and Place (Transcript)
New Concepts: Timespan and Place (Transcript)New Concepts: Timespan and Place (Transcript)
New Concepts: Timespan and Place (Transcript)ALAeLearningSolutions
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
New Concepts: Fictitious and Non-human Personages (Transcript)
New Concepts: Fictitious and Non-human Personages (Transcript)New Concepts: Fictitious and Non-human Personages (Transcript)
New Concepts: Fictitious and Non-human Personages (Transcript)ALAeLearningSolutions
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing Rajnish Raj
 
Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...
Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...
Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...ALAeLearningSolutions
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 

Tendances (20)

Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Smart Data Webinar: Advances in Natural Language Processing I - Understanding
Smart Data Webinar: Advances in Natural Language Processing I - UnderstandingSmart Data Webinar: Advances in Natural Language Processing I - Understanding
Smart Data Webinar: Advances in Natural Language Processing I - Understanding
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
 
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
 
SearchLove London - Analysing the SERPs for SEO, Content & Customer Insights
SearchLove London - Analysing the SERPs for SEO, Content & Customer InsightsSearchLove London - Analysing the SERPs for SEO, Content & Customer Insights
SearchLove London - Analysing the SERPs for SEO, Content & Customer Insights
 
Natural language search using Neo4j
Natural language search using Neo4jNatural language search using Neo4j
Natural language search using Neo4j
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4j
 
SearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your Audience
SearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your AudienceSearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your Audience
SearchLove London 2019 - Rory Truesdale - Using the SERPs to Know Your Audience
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Conversational AI for Real Estate
Conversational AI for Real EstateConversational AI for Real Estate
Conversational AI for Real Estate
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search LandscapeBearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
New Concepts: Timespan and Place (Transcript)
New Concepts: Timespan and Place (Transcript)New Concepts: Timespan and Place (Transcript)
New Concepts: Timespan and Place (Transcript)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
New Concepts: Fictitious and Non-human Personages (Transcript)
New Concepts: Fictitious and Non-human Personages (Transcript)New Concepts: Fictitious and Non-human Personages (Transcript)
New Concepts: Fictitious and Non-human Personages (Transcript)
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing
 
Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...
Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...
Special Topics: Recording Methods and Transcription Guidelines--Transcript (J...
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 

Similaire à Using topic modelling frameworks for NLP and semantic search

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Dawn Anderson MSc DigM
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Tracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a NutshellTracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a NutshellTracingNetworks
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesRafael Alvarado
 

Similaire à Using topic modelling frameworks for NLP and semantic search (20)

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
Eacl 2006 Pedersen
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Icon 2007 Pedersen
Icon 2007 PedersenIcon 2007 Pedersen
Icon 2007 Pedersen
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
 
The Semantic Quilt
The Semantic QuiltThe Semantic Quilt
The Semantic Quilt
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
Tracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a NutshellTracing Networks: Ontology-based Software in a Nutshell
Tracing Networks: Ontology-based Software in a Nutshell
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databases
 
Semantic Web and Linked Open Data
Semantic Web and Linked Open DataSemantic Web and Linked Open Data
Semantic Web and Linked Open Data
 
The Duet model
The Duet modelThe Duet model
The Duet model
 

Plus de Dawn Anderson MSc DigM

Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfDawn Anderson MSc DigM
 
Life of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
Life of An SEO - Surfing The Waves of Googles Many Algorithmic UpdatesLife of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
Life of An SEO - Surfing The Waves of Googles Many Algorithmic UpdatesDawn Anderson MSc DigM
 
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...Dawn Anderson MSc DigM
 
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Dawn Anderson MSc DigM
 
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...Dawn Anderson MSc DigM
 
Planning an SEO Strategy for a New Website - SMXL Milan 2019
Planning an SEO Strategy for a New Website - SMXL Milan 2019Planning an SEO Strategy for a New Website - SMXL Milan 2019
Planning an SEO Strategy for a New Website - SMXL Milan 2019Dawn Anderson MSc DigM
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Dawn Anderson MSc DigM
 
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...Dawn Anderson MSc DigM
 
SEO and The Mobile-First Paradigm Shift
SEO and The Mobile-First Paradigm ShiftSEO and The Mobile-First Paradigm Shift
SEO and The Mobile-First Paradigm ShiftDawn Anderson MSc DigM
 
Pubcon florida 2018 logs dont lie dawn anderson
Pubcon florida 2018 logs dont lie dawn andersonPubcon florida 2018 logs dont lie dawn anderson
Pubcon florida 2018 logs dont lie dawn andersonDawn Anderson MSc DigM
 
Voice Search Challenges For Search and Information Retrieval and SEO
Voice Search Challenges For Search and Information Retrieval and SEOVoice Search Challenges For Search and Information Retrieval and SEO
Voice Search Challenges For Search and Information Retrieval and SEODawn Anderson MSc DigM
 
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO BeastDigital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO BeastDawn Anderson MSc DigM
 
Cruft busting technical debt code smell and refactoring for seo - state of ...
Cruft busting   technical debt code smell and refactoring for seo - state of ...Cruft busting   technical debt code smell and refactoring for seo - state of ...
Cruft busting technical debt code smell and refactoring for seo - state of ...Dawn Anderson MSc DigM
 
Duplicate Content Myths Types and Ways To Make It Work For You
Duplicate Content Myths Types and Ways To Make It Work For YouDuplicate Content Myths Types and Ways To Make It Work For You
Duplicate Content Myths Types and Ways To Make It Work For YouDawn Anderson MSc DigM
 
SEO - The Rise of Persona Modelled Intent Driven Contextual Search
SEO - The Rise of Persona Modelled Intent Driven Contextual SearchSEO - The Rise of Persona Modelled Intent Driven Contextual Search
SEO - The Rise of Persona Modelled Intent Driven Contextual SearchDawn Anderson MSc DigM
 
Technical SEO - Generational cruft in SEO - there is never a new site when th...
Technical SEO - Generational cruft in SEO - there is never a new site when th...Technical SEO - Generational cruft in SEO - there is never a new site when th...
Technical SEO - Generational cruft in SEO - there is never a new site when th...Dawn Anderson MSc DigM
 
MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...
MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...
MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...Dawn Anderson MSc DigM
 
Creating Commerce Reviews and Considering The Case For User Generated Reviews
Creating Commerce Reviews and Considering The Case For User Generated ReviewsCreating Commerce Reviews and Considering The Case For User Generated Reviews
Creating Commerce Reviews and Considering The Case For User Generated ReviewsDawn Anderson MSc DigM
 

Plus de Dawn Anderson MSc DigM (20)

Human vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdfHuman vs AI Quality Raters for Search Engines.pdf
Human vs AI Quality Raters for Search Engines.pdf
 
Life of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
Life of An SEO - Surfing The Waves of Googles Many Algorithmic UpdatesLife of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
Life of An SEO - Surfing The Waves of Googles Many Algorithmic Updates
 
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
Zipfs Law & Zipfian Distribution in SEO - Pubcon Virtual Fall 2020 - Dawn And...
 
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
Disambiguating Equiprobability in SEO Dawn Anderson Friends of Search 2020
 
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
Connecting The Worlds of Information Retrieval & SEO - Search solutions 2019 ...
 
Planning an SEO Strategy for a New Website - SMXL Milan 2019
Planning an SEO Strategy for a New Website - SMXL Milan 2019Planning an SEO Strategy for a New Website - SMXL Milan 2019
Planning an SEO Strategy for a New Website - SMXL Milan 2019
 
SEO in a Mobile First World
SEO in a Mobile First WorldSEO in a Mobile First World
SEO in a Mobile First World
 
Modern Ecommerce SEO
Modern Ecommerce SEOModern Ecommerce SEO
Modern Ecommerce SEO
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
 
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
The Iceberg Approach - Power from what lies beneath in SEO for a mobile-first...
 
SEO and The Mobile-First Paradigm Shift
SEO and The Mobile-First Paradigm ShiftSEO and The Mobile-First Paradigm Shift
SEO and The Mobile-First Paradigm Shift
 
Pubcon florida 2018 logs dont lie dawn anderson
Pubcon florida 2018 logs dont lie dawn andersonPubcon florida 2018 logs dont lie dawn anderson
Pubcon florida 2018 logs dont lie dawn anderson
 
Voice Search Challenges For Search and Information Retrieval and SEO
Voice Search Challenges For Search and Information Retrieval and SEOVoice Search Challenges For Search and Information Retrieval and SEO
Voice Search Challenges For Search and Information Retrieval and SEO
 
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO BeastDigital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
 
Cruft busting technical debt code smell and refactoring for seo - state of ...
Cruft busting   technical debt code smell and refactoring for seo - state of ...Cruft busting   technical debt code smell and refactoring for seo - state of ...
Cruft busting technical debt code smell and refactoring for seo - state of ...
 
Duplicate Content Myths Types and Ways To Make It Work For You
Duplicate Content Myths Types and Ways To Make It Work For YouDuplicate Content Myths Types and Ways To Make It Work For You
Duplicate Content Myths Types and Ways To Make It Work For You
 
SEO - The Rise of Persona Modelled Intent Driven Contextual Search
SEO - The Rise of Persona Modelled Intent Driven Contextual SearchSEO - The Rise of Persona Modelled Intent Driven Contextual Search
SEO - The Rise of Persona Modelled Intent Driven Contextual Search
 
Technical SEO - Generational cruft in SEO - there is never a new site when th...
Technical SEO - Generational cruft in SEO - there is never a new site when th...Technical SEO - Generational cruft in SEO - there is never a new site when th...
Technical SEO - Generational cruft in SEO - there is never a new site when th...
 
MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...
MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...
MOZCON 2017 WINNING WITH CHOICE & INFORMATION SYSTEMS FOR BOTH CRAWLERS & CON...
 
Creating Commerce Reviews and Considering The Case For User Generated Reviews
Creating Commerce Reviews and Considering The Case For User Generated ReviewsCreating Commerce Reviews and Considering The Case For User Generated Reviews
Creating Commerce Reviews and Considering The Case For User Generated Reviews
 

Dernier

Catálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compeltaCatálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compeltaEsteller
 
2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisition2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisitionJohn Koetsier
 
Exploring the Impact of Social Media Trends on Society.pdf
Exploring the Impact of Social Media Trends on Society.pdfExploring the Impact of Social Media Trends on Society.pdf
Exploring the Impact of Social Media Trends on Society.pdfolivalibereo
 
History of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdfHistory of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdfwilliam charnock
 
Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...
Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...
Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...Nugget Global
 
The Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdfThe Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdfFinance Advertising Network
 
Understand the Key differences between SMO and SMM
Understand the Key differences between SMO and SMMUnderstand the Key differences between SMO and SMM
Understand the Key differences between SMO and SMMsearchextensionin
 
Fritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completaFritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completaEsteller
 
Master the art of Social Selling to increase sales by fostering relationships...
Master the art of Social Selling to increase sales by fostering relationships...Master the art of Social Selling to increase sales by fostering relationships...
Master the art of Social Selling to increase sales by fostering relationships...VereigenMedia1
 
The Process of Google: A Journey through Innovation
The Process of Google: A Journey through InnovationThe Process of Google: A Journey through Innovation
The Process of Google: A Journey through Innovationgopzzzin
 
How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)Blessings Ngalande
 
Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...
Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...
Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...Associazione Digital Days
 
Llanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation StrategyLlanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation StrategyMarianna Nakou
 
Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...
Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...
Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...Associazione Digital Days
 
Gen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdfGen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdfMedia Logic
 
Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...
Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...
Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...Valters Lauzums
 
Digital Marketing complete introduction.
Digital Marketing complete introduction.Digital Marketing complete introduction.
Digital Marketing complete introduction.Kashish Bindra
 
Dave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO Deck
Dave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO DeckDave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO Deck
Dave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO DeckOban International
 
HAGAN_Katalog_Saison23-24_Overview_Preview
HAGAN_Katalog_Saison23-24_Overview_PreviewHAGAN_Katalog_Saison23-24_Overview_Preview
HAGAN_Katalog_Saison23-24_Overview_PreviewEsteller
 
Lesotho-Botswana Water Project Brand Manual developed with new logo
Lesotho-Botswana Water Project Brand Manual developed with  new logoLesotho-Botswana Water Project Brand Manual developed with  new logo
Lesotho-Botswana Water Project Brand Manual developed with new logonelaohaimbodi
 

Dernier (20)

Catálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compeltaCatálogo Sea To Summit 2024 gama compelta
Catálogo Sea To Summit 2024 gama compelta
 
2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisition2024 WTF - what's working in mobile user acquisition
2024 WTF - what's working in mobile user acquisition
 
Exploring the Impact of Social Media Trends on Society.pdf
Exploring the Impact of Social Media Trends on Society.pdfExploring the Impact of Social Media Trends on Society.pdf
Exploring the Impact of Social Media Trends on Society.pdf
 
History of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdfHistory of JWT by The Knowledge Center.pdf
History of JWT by The Knowledge Center.pdf
 
Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...
Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...
Make Your Message Go Viral with Nugget Global's Press Release Distribution Se...
 
The Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdfThe Ultimate Guide to Financial Advertising Strategies.pdf
The Ultimate Guide to Financial Advertising Strategies.pdf
 
Understand the Key differences between SMO and SMM
Understand the Key differences between SMO and SMMUnderstand the Key differences between SMO and SMM
Understand the Key differences between SMO and SMM
 
Fritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completaFritschi Collection 2022/23 EN gama completa
Fritschi Collection 2022/23 EN gama completa
 
Master the art of Social Selling to increase sales by fostering relationships...
Master the art of Social Selling to increase sales by fostering relationships...Master the art of Social Selling to increase sales by fostering relationships...
Master the art of Social Selling to increase sales by fostering relationships...
 
The Process of Google: A Journey through Innovation
The Process of Google: A Journey through InnovationThe Process of Google: A Journey through Innovation
The Process of Google: A Journey through Innovation
 
How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)How To Become a Master In Search Engine Optimization (SEO)
How To Become a Master In Search Engine Optimization (SEO)
 
Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...
Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...
Francesco d’Angela, Service Designer di @HintoGroup- “Oltre la Frontiera Crea...
 
Llanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation StrategyLlanai Buyer Persona & Segmentation Strategy
Llanai Buyer Persona & Segmentation Strategy
 
Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...
Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...
Richard van der Velde, Technical Support Lead for Cookiebot @CMP – “Artificia...
 
Gen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdfGen Z and Millennial Debit Card Use Survey.pdf
Gen Z and Millennial Debit Card Use Survey.pdf
 
Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...
Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...
Social Media Marketing Lecture for Advanced Digital & Social Media Strategy a...
 
Digital Marketing complete introduction.
Digital Marketing complete introduction.Digital Marketing complete introduction.
Digital Marketing complete introduction.
 
Dave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO Deck
Dave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO DeckDave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO Deck
Dave Cousin TW-BERT Good for Users, Good for SEOsBrighton SEO Deck
 
HAGAN_Katalog_Saison23-24_Overview_Preview
HAGAN_Katalog_Saison23-24_Overview_PreviewHAGAN_Katalog_Saison23-24_Overview_Preview
HAGAN_Katalog_Saison23-24_Overview_Preview
 
Lesotho-Botswana Water Project Brand Manual developed with new logo
Lesotho-Botswana Water Project Brand Manual developed with  new logoLesotho-Botswana Water Project Brand Manual developed with  new logo
Lesotho-Botswana Water Project Brand Manual developed with new logo
 

Using topic modelling frameworks for NLP and semantic search

  • 1. Using Topic Modelling To Win Big with NLP & Semantic Search Dawn Anderson @DawnieAndo from @MoveItMarketing
  • 2. If I said to you… “I’ve got a new jaguar”
  • 3. “It’s in the garage” (sidenote: this is not my garage)
  • 7. “Jag is neither a car nor a cat”
  • 8. Enable gzip compression via caching plugins, .htaccess or via compression plugins MOBILE NO INTERSTITIALS ON MOBILE THANK YOU Polysemy in linguistics is problematic
  • 9. DISCLAIMER I am NOT a data scientist
  • 10.
  • 11. But I will be talking about some concepts covering: Data Sciene 01 Information Retrieval 02 Algorithms 03 Linguistics 04 Information Architecture 05 Library Science
  • 12. Which are areas relevant to search industry
  • 13. These are all connected to how search engines find the right information, for the right informational need at the right time
  • 14. ‘information retrieval’ To extract informational resources to meet a search engine user’s information need at time of query.
  • 16. Crawl (and render) the haystack = Crawling frontier
  • 17. Organise the straw into bales – indexing (2 wave)
  • 21. But there is so much hay
  • 23. Every day there are huge volumes of new indexable data
  • 24. But we only want to return one (or a few) needle (s) of hay
  • 25.
  • 27. One example … Google’s mobile-first indexing plans
  • 28.
  • 29. MOBILE-FIRST GOES FAR BEYOND WEBSITES
  • 30.
  • 31. Time and space, distance, speed of movement come into play
  • 33. A lot of users might be interested in topical foraging too (information foraging theory)
  • 34. They might want to learn about whole topic of hay or straw
  • 35. 900 Or they may be researching to buy a car and want lots of different types of information on cars
  • 36.
  • 37. You think there are several types of SEO? Local SEO Technical SEO Schema Specialist Content Marketer Outreach Specialist Digital PR
  • 38. There are at least as many niche areas of Information Retrieval Mobile IR Contextual Search Natural Language Processing Conversational Search Similarity Search Recommender Systems Library Science
  • 39.
  • 40. The problem is… words are hard
  • 41. Every other word in the English language has multiple meanings
  • 42.
  • 43. But…If we understood a topic is about cats we would recognize a jaguar
  • 44. On their own single words have no semantic meaning
  • 45. How can we understand these word meanings?
  • 46. Using structured data is an obvious way to disambiguate
  • 47. Structured versus unstructured data • Structured data – high degree of organization • Readily searchable by simple search engine algorithms or known search operators (e.g. SQL) • Logically organized • Often stored in a relational database
  • 52. Conversational search The knowledge graph is checked first
  • 53. Ontology Driven Natural Language Processing Image credit: IBM https://www.ibm.com/developerworks/community/blogs/nlp/entry/ontology_driven_nlp
  • 54. Even named entities can be ambiguous / polysemic • Amadeus Mozart (composer) • Mozart Street • Mozart Cafe
  • 55. Australian towns using English names (NSW example only)
  • 56.
  • 57. An area of IR dedicated to understanding the ambiguous needs for queries with multiple meanings
  • 58. How can we fill in the gaps between named entities?
  • 59. When there is so much noise
  • 60. There are still many open challenges in natural language processing
  • 61.
  • 62. Text cohesion • Cohesion is the grammatical and lexical linking within a text or sentence that holds a text together and gives it meaning. • It is related to the broader concept of coherence. (Wikipedia)
  • 63. ‘Topic Modelling’ According to Wikipedia: “In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body”.
  • 64. A collection of text based web pages (corpus)
  • 65. A collection of text based documents (corpus)
  • 68. “You shall know a word by the company it keeps” (John Rupert Firth, 1957)
  • 70.
  • 71. Relatedness is NOT about structured data
  • 72. 2 words are similar if they co-occur with similar words
  • 73. First Level Relatedness – Words that appear together in the same sentence
  • 74. 2 words are similar if they occur in a given grammatical relation with the same words Harvest Peel Eat Slice
  • 75. Second Level Relatedness – words that co-occur with the same ‘other’ words
  • 76. We need more words
  • 77.
  • 78. WordSim353 Dataset Some words with high similarity or relatedness Word 1 Word 2 Human (mean) tiger tiger 10 fuck sex 9.44 journey voyage 9.29 midday noon 9.29 dollar buck 9.22 money cash 9.15 coast shore 9.1 money cash 9.08 money currency 9.04 football soccer 9.03 magician wizard 9.02 type kind 8.97 gem jewel 8.96 car automobile 8.94 street avenue 8.88 asylum madhouse 8.87 boy lad 8.83 environment ecology 8.81 furnace stove 8.79 seafood lobster 8.7 mile kilometer 8.66 Maradona football 8.62 OPEC oil 8.59 king queen 8.58 murder manslaughter 8.53 money bank 8.5 computer software 8.5 Jerusalem Israel 8.46 vodka gin 8.46 planet star 8.45
  • 79. A Moving Word ‘Context Window’
  • 80. Typical window size might be 5 Source Text Writing a list of random sentences is harder than I Initially thought it would be Writing a list of random sentences is harder than I Initially thought it would be Writing a list of random sentences is harder than I Initially thought it would be Writing a list of random sentences is harder than I Initially thought it would be 11 letters (5 left and 5 right of the moving target word)
  • 82. Vector representations of words (Word Vectors)
  • 87. Continuous Bag of Words (CBOW) Taking a continuous bag of words with no context utilize a context window of n size n-gram) to ascertain words which are similar or related using Euclidean distances to create vector models and word embeddings
  • 88. The opposite of CBOW (continuous bag of words) Skip-gram model
  • 90. Both models learn the weights of the similarity and relatedness distances
  • 92. Word2Vec Single words Word embeddings 01 Doc2Vec Words & meta data Word embeddings Document embeddings 02 Sentence2Vec Chunks of words More context available 03 Paragraph2Vec Full paragraphs Even more context and semantics 04
  • 93. In order to understand what words in documents constitute ‘relevance’ to a query
  • 94. Testing Similarity and Relatedness http://ws4jdemo.appspot.com
  • 95. GloVe: Global Vectors for Word Representation • What is GloVe? • “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.” • (https://nlp.stanford.edu/projects/glove/)
  • 96. Linear Substructures in GloVe • Sometimes more than one word pair is needed to understand meaning • Particularly when the words are opposites of each other (e.g. man and woman) • By adding addition word pairs further semantic hints provide context to understand meaning of concepts
  • 97. GloVe: Nearest Neighbour Cosine Similarity • Nearest words to Frog • https://nlp.stanford.edu/projects/glove/ Glove2Vec
  • 100. Concept Graphs Using Relatedness
  • 101. Wikipedia is a gold mine for IR researchers – each page is considered a concept
  • 102. Similarity and Relatedness Similarity – words that mean the same or nearly the same Relatedness - Words that live together within a topic / co-occur in the same corpora / collection / sub-section of a collection
  • 105. A website is NOT unstructured data It has a hierarchy It has weighted sections It has metadata It (often) has a tree like structure
  • 106.
  • 107. • BM25 • BM25+ • BM25L • OKAPI BM25
  • 108.
  • 109. BM = BEST MATCH
  • 110.
  • 111.
  • 113. Probably BM25F is used for web pages BM25F allows for web pages which have structure compared with normal flat text output (e.g. from text files) (additional fields) Takes into consideration elements such as page title, meta data, sections, footers, headers, anchor text Adds weights for different elements on a page
  • 115.
  • 116. Semi- structured data • Hierarchical nature of a website • Tree structure • Well sectioned and including clear containers and meta headings • An ontology map between semi and structured
  • 117. Lexical ‘nyms’ Antonym – The opposite meaning Synonym – The same meaning Meronym – Part of something else (whole) – e.g. finger (hand) (Part / whole relations) Hyponym – A subset of something else – e.g. fork (cutlery) Hypernym – A superset (superordinate) – e.g. colour hypernym
  • 118.
  • 120.
  • 121. TF:IDF LOCAL v GLOBAL? (The whole document collection) Across your site?? Across all documents relevant for the topic??
  • 122. A website is NOT unstructured data It has a hierarchy It has weighted sections It has metadata It (often) has a tree like structure
  • 124.
  • 127. Predicting the future with Web Dynamics • The journey to predict the future: Kira Radinsky at TEDxHiriya
  • 128. Find out what correlates and when
  • 129. How can we improve our topical relatedness?
  • 130. Tell Me About Your Haystack
  • 131. Cancel your noise Unstructured data is voluminous Filled with irrelevance Lacks focus Riddled with stopwords Lots of meaningless text and further ambiguating jabber
  • 132. Disambiguate lean content with powerful structured data
  • 133. And strong linking nearest neighbour topically rich pages
  • 134. Use well organised hyponyms (Hyponomy and Hypernymy) • Cutlery • Spoons • Dessert • Tea • Table • Forks • Knives • Carving • Steak • Butchers Hypernym Hyponym + Hypernym (co) Hyponym (co) Hyponym (co) Hyponym Hyponym + Hypernym Hyponym + Hypernym (co) Hyponym (co) Hyponym (co) Hyponym Simple unordered list with children
  • 135. Image alt tags (and image title tags) help with disambiguation too
  • 136. Stemming & Lemmatization Both aim to take a word back to its common base form Avoid keyword stuffing… be aware of stemming and lemmatization
  • 137.
  • 138. Tables are relational databases too – use liberally (with headers) ID Event Name Event Type Event City Event Country 1 Ungagged Las Vegas Conference Las Vegas US 2 Ungagged London Conference London UK 3 State of Digital Conference London UK 4 Brighton SEO Conference Brighton UK
  • 139. Widget Logic / Widget Context
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145. Stay in your topical lane Topical drift / dilution is a big problem
  • 147. Merge content but watch out for topical dilution – what did Wikipedia redirect? The whole is greater than the sum of its parts
  • 152. In theory… the consolidated page should rank higher… but…
  • 154. Throw the words into a word cloud
  • 155. So the most prominent topics & nuances appear
  • 156. Watch out for topic dilution / drift in user generated content
  • 157. Educate crazy taggers but not before you’ve used their topic tags to fix dilution
  • 158. All the anchors & contextual & navigational internal linking
  • 159. Even if it is just a breadcrumb trail
  • 160.
  • 161.
  • 162.
  • 163. Sources and References • Kira Radinsky Tedx Talk - https://www.youtube.com/watch?v=gAifa_CVGCY • Stop Word Library Example - https://sites.google.com/site/kevinbouge/stopwords-lists • Image credit: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc. • The work of - Radinsky, K., 2012, December. Learning to predict the future using Web knowledge and dynamics. In ACM SIGIR Forum(Vol. 46, No. 2, pp. 114-115). ACM. • http://9ol.es/porter_js_demo.html
  • 164. Sources and References • Lohar, P., Ganguly, D., Afli, H., Way, A. and Jones, G.J., 2016. FaDA: Fast document aligner using word embedding. The Prague Bulletin of Mathematical Linguistics, 106(1), pp.169-179. • https://en.wikipedia.org/wiki/List_of_locations_in_Australia_with_an _English_name • Barbara Plank | Keynote - Natural Language Processing: - https://www.youtube.com/watch?v=Wl6c0OpF6Ho
  • 165. Further Reading • https://github.com/Hironsan/awesome-embedding-models • https://nlp.stanford.edu/IR-book/html/htmledition/document- representations-and-measures-of-relatedness-in-vector-spaces- 1.html • https://www.youtube.com/watch?time_continue=790&v=wI5O- lYLBCw • https://en.wikipedia.org/wiki/Euclidean_distance • Ibrahim, O.A.S. and Landa-Silva, D., 2016. Term frequency with average term occurrences for textual information retrieval. Soft Computing, 20(8), pp.3045-3061.
  • 166. Further Reading • Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C. and McGinnity, M., 2018. Advances in Computational Intelligence Systems. Intelligence. • https://www.researchgate.net/post/What_is_the_difference_betwee n_TFIDF_and_term_distribution_for_feature_selection • https://radimrehurek.com/gensim/models/word2vec.html • McDonald, R., Brokos, G.I. and Androutsopoulos, I., 2018. Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682.
  • 167. Further Reading Boyd-Graber, J., Hu, Y. and Mimno, D., 2017. Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2-3), pp.143-296. https://nlp.stanford.edu/projects/glove/ https://nlp.stanford.edu/IR-book/html/htmledition/tokenization- 1.html Sherkat, E. and Milios, E.E., 2017, June. Vector embedding of wikipedia concepts and entities. In International conference on applications of natural language to information systems (pp. 418-428). Springer, Cham.
  • 169. Precision and Recall GOLD STANDARD Lots of results inaccurately deemed highly relevant retrieved. Lots of results inaccurately deemed irrelevant not retrieved Maybe not enough relevant documents to fetch much here Lots of documents came back but not enough highly relevant Many highly relevant docs to meet informational need returned Results were highly relevant but not enough came back Maybe being ‘too picky’ A wide net was cast but not many of the right type of fishes caught https://medium.com/@klintcho/explaining-precision-and-recall-c770eb9c69e9 Manning, C.D., Manning, C.D. and Schütze, H., 1999. Foundations of statistical natural language processing. MIT press
  • 170. Canadian towns using English names
  • 171. US towns using English names
  • 172. You should be aware ‘indexing’ and ‘ranking’ are two very separate things
  • 173. Example context window size 3 Source Text Training Samples The quick brown fox jumps over the lazy dog (the, quick) (the, brown) (the, fox) The quick brown fox jumps over the lazy dog (quick, the) (quick, brown) (quick, fox) (quick, jumps) The quick brown fox jumps over the lazy dog Etcetera The quick brown fox jumps over the lazy dog Etcetera
  • 174.
  • 175. Stemming (Popular stemmer is PorterStemmer) • https://github.com/johnpcarty/mysql-porter- stemmer/blob/master/porterstemmer.sql Conditions Suffix Replacement Examples -------------------------- ------- ------------- ----------------------- (m>1) al NULL revival -> reviv (m>1) ance NULL allowance -> allow (m>1) ence NULL inference -> infer (m>1) er NULL airliner-> airlin (m>1) ic NULL gyroscopic -> gyroscop (m>1) able NULL adjustable -> adjust (m>1) ible NULL defensible -> defens (m>1) ant NULL irritant -> irrit (m>1) ement NULL replacement -> replac (m>1) ment NULL adjustment -> adjust (m>1) ent NULL dependent -> depend (m>1 and (*<S> or *<T>)) ion NULL adoption -> adopt (m>1) ou NULL homologou-> homolog (m>1) ism NULL communism-> commun (m>1) ate NULL activate -> activ (m>1) iti NULL angulariti -> angular (m>1) ous NULL homologous -> homolog (m>1) ive NULL effective -> effect (m>1) ize NULL bowdlerize -> bowdler
  • 176. WordSim353 Dataset Some words with low similarity or relatedness Word 1 Word 2 Human (mean) king cabbage 0.23 professor cucumber 0.31 chord smile 0.54 noon string 0.54 rooster voyage 0.62 sugar approach 0.88 stock jaguar 0.92 stock life 0.92 monk slave 0.92 lad wizard 0.92 delay racism 1.19 stock CD 1.31 drink ear 1.31 stock phone 1.62 holy sex 1.62 production hike 1.75 precedent group 1.77 stock egg 1.81 energy secretary 1.81 month hotel 1.81 forest graveyard 1.85 cup substance 1.92 possibility girl 1.94 cemetery woodland 2.08 glass magician 2.08 cup entity 2.15 Wednesday news 2.22 direction combination 2.25
  • 177. Coast and Shore Example • Coast and shore have a similar meaning • They co-occur in first and second level relatedness documents in a collection • They would receive a high score in similarity
  • 179. There are 68 variants for mobile phone redirected on Wikipedia
  • 181. Context window (n = window size = 2) (2 words either side) Source Text Training Samples The quick brown fox jumps over the lazy dog (the, quick) (the, brown) The quick brown fox jumps over the lazy dog (quick, the) (quick, brown) (quick, fox) The quick brown fox jumps over the lazy dog (brown, the) (brown, quick) (brown, fox) (brown, jumps) The quick brown fox jumps over the lazy dog (fox, quick) (fox, brown) (fox, jumps) (fox, over)
  • 182. Other areas of IR Including… but not limited to Mobile IR Contextual IR Natural language processing Conversational search Similarity search Recommender systems Image IR Music IR
  • 183. Recall and Precision in IR Precision is the best results – the most relevant for the query Recall is all of the results returned for the query
  • 184. Likely based on co-occurrence data https://slideplayer.com/slide/13138343/ - Query Expansion and Relevance Feedback Increases recall (more results) but may reduce precision Query Expansion Example
  • 185. Increase Recall – 2 Main Methods Query Expansion / Query Rewriting Query Relaxation • Ignore stop words in query • Relax specificity (remove specific) • Use lexical database (a “knowledge graph” (e.g. Wordnet) to find a more general term (hypernym - superset) • Use Part of Speech Tagger to identify structure of query & expand nouns (things) • Preserve head noun and strip modifiers (e.g. dog) • Use Word2Vec to identify semantics from a vector space using word embeddings from the query – find semantics (related and similar) Take bits away from the query Add bits to the query • Broaden the query • Expand abbreviations • Stemming and lemmatization (in reverse) • Use Word2Vec to find abbreviations in a vector space using semantic similarity • Use synonyms (same / very similar meanings) • Use minimum cosine similarity from Word2Vec as safety net
  • 187. Most popular word embedding tool – probably Word2Vec (new ones are emerging)
  • 188. Stemming & Lemmatization Stemming • Runs a series of rules to chop known ‘stems’ off the end of words • Suffix stripping algorithm • Often leaves incorrect endings/ crude performance (understemming) • Popular – PorterStemmer (Martin Porter) • Example: “alumnus” -> “alumnu” Lemmatization • Aims to do things properly • Tool from natural language processing • Needs a complete vocabulary and morphological analysis to work well • Also not perfect • Relies on lexical knowledge base like WordNet to correct base form
  • 189. Gensim
  • 190. Part of Speech Tags (Python NLTK Library) • NNPS proper noun, plural ‘Americans’ • PDT predeterminer ‘all the kids’ • POS possessive ending parent’s • PRP personal pronoun I, he, she • PRP$ possessive pronoun my, his, hers • RB adverb very, silently, • RBR adverb, comparative better • RBS adverb, superlative best • RP particle give up • TO, to go ‘to’ the store. • UH interjection, errrrrrrrm • CC coordinating conjunction • CD cardinal digit • DT determiner • EX existential there (like: “there is” … think of it like “there exists”) • FW foreign word • IN preposition/subordinating conjunction • JJ adjective ‘big’ • JJR adjective, comparative ‘bigger’ • JJS adjective, superlative ‘biggest’ • LS list marker 1) • MD modal could, will • NN noun, singular ‘desk’ • NNS noun plural ‘desks’ • NNP proper noun, singular ‘Harrison’ • VB verb, base form take • VBD verb, past tense took • VBG verb, gerund/present participle taking • VBN verb, past participle taken • VBP verb, sing. present, non-3d take • VBZ verb, 3rd person sing. present takes • WDT wh-determiner which • WP wh-pronoun who, what • WP$ possessive wh-pronoun whose • WRB wh-abverb where, when
  • 191. Enable gzip compression via caching plugins, .htaccess or via compression plugins MOBILE NO INTERSTITIALS ON MOBILE THANK YOU Stop Word Libraries are Huge
  • 192. Enable gzip compression via caching plugins, .htaccess or via compression plugins MOBILE NO INTERSTITIALS ON MOBILE THANK YOU a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associate d at available away awfully ‘A’ words in an EN stop word list
  • 193. • Anaphora resolution (AR) which most commonly appears as pronoun resolution is the problem of resolving references to earlier or later items in the discourse. • Example: "John found the love of his life" where 'his' refers to 'John’ • ‘His’ refers to John (easily understood by humans but not so much by machines) Example and definition from: https://nlp.stanford.edu/courses/cs224n/2003/fp/iqsayed/project_report.pdf Anaphora
  • 194. Cataphora – According to Wikipedia In linguistics cataphora is the use of an expression or word that co-refers with a later, more specific, expression in the discourse. EXAMPLE: “When he arrived home, John went to sleep” HE IS JOHN, BUT JOHN WAS NOT KNOWN WHEN ‘HE’ WAS REFERRED TO – SO CAUSES CONFUSION REGARDING WHO ‘HE’ IS
  • 195. Anaphora and Coreference Resolution • There are some algorithms in place to handle anaphora resolution • In conversational search this still struggles after a few multi-turn questions
  • 196.
  • 197. NLTK Toolkit Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
  • 198.
  • 199.
  • 200. Chunk = Usually chunks of words Token = Usually a word
  • 201. Digital Marketing Events – Using Tables • https://research.google.com /tables
  • 202. Information Architecture for the World Wide Web, 3rd Edition by Louis Rosenfeld, Peter Morville
  • 203. How can all this help you as an SEO? • Consider exploring your topical models by using noindex admin only tag clouds to visualise the topics you see • Utilise relatedness considering words likely in Sim353 or other data sources • Pass supporting topical hints to emphasise the meanings in 1st and 2nd level relatedness • Consider query intent shift and mobile IR / contextual search as niche fields of IR • Utilise semi-structured elements to strengthen unstructured pages in noisy (particularly longer) pages
  • 204. How can all this help you as an SEO? • Further disambiguation measures on locations used in different countries with same name • Be consistent in your naming conventions. Refer to Wikipedia if in doubt – check their redirects on terms • Other semantic clues for entities with same name – e.g. gender / role / location / geographic clues • Utilise anchors to emphasise further from 2nd level relatedness pages • Utilise co-occurring terms from databases considered similar / connected / related