Using topic modelling frameworks for NLP and semantic search

Using Topic
Modelling To
Win Big with
NLP &
Semantic
Search
Dawn Anderson
@DawnieAndo from
@MoveItMarketing

If I said to you…
“I’ve got a new
jaguar”

“It’s in the garage”
(sidenote: this is not my garage)

You probably
wouldn’t expect
to see this

“Jag is
neither a car
nor a cat”

Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
Polysemy in linguistics is
problematic

DISCLAIMER
I am NOT a data
scientist

But I will be talking about some concepts
covering:
Data Sciene
01
Information
Retrieval
02
Algorithms
03
Linguistics
04
Information
Architecture
05
Library
Science

Which are areas
relevant to search
industry

These are all connected
to how search engines
find the right
information, for the right
informational need at
the right time

‘information
retrieval’
To extract informational resources to meet a
search engine user’s information need at time of
query.

Crawl (and render) the haystack = Crawling frontier

Organise the straw into bales – indexing (2 wave)

Inverted
Index: Text to
Doc ID
Mapping

Every day there
are huge volumes
of new indexable
data

But we only want to return one (or a few) needle (s) of hay

Accelerating
technological
developments have
made search even more
complicated

One example … Google’s
mobile-first indexing plans

MOBILE-FIRST GOES FAR BEYOND WEBSITES

Time and space, distance, speed of
movement come into play

Contextual
Search
Exacerbates
everything
further

A lot of users might be
interested in topical foraging too
(information foraging theory)

They might want to learn about whole topic of hay or straw

900
Or they may be researching to
buy a car and want lots of
different types of information
on cars

You think
there are
several
types of
SEO?
Local SEO
Technical SEO
Schema Specialist
Content Marketer
Outreach Specialist
Digital PR

There are at
least as
many niche
areas of
Information
Retrieval
Mobile IR
Contextual
Search
Natural
Language
Processing
Conversational
Search
Similarity
Search
Recommender
Systems
Library Science

The problem is… words are
hard

Every other word
in the English
language has
multiple
meanings

But…If we understood a topic is
about cats we would recognize a
jaguar

On their own single words
have no semantic meaning

How can we understand
these word meanings?

Using structured
data is an obvious
way to disambiguate

Structured versus unstructured data
• Structured data – high
degree of organization
• Readily searchable by
simple search engine
algorithms or known search
operators (e.g. SQL)
• Logically organized
• Often stored in a relational
database

Conversational search
The knowledge graph
is checked first

Ontology Driven Natural Language Processing
Image credit: IBM
https://www.ibm.com/developerworks/community/blogs/nlp/entry/ontology_driven_nlp

Even named entities can
be ambiguous /
polysemic
• Amadeus Mozart (composer)
• Mozart Street
• Mozart Cafe

Australian towns using English names
(NSW example only)

An area of IR dedicated to understanding the ambiguous
needs for queries with multiple meanings

How can we fill in
the gaps between
named entities?

There are still many
open challenges in
natural language
processing

Text cohesion
• Cohesion is
the grammatical and lexical linking
within a text or sentence that holds a
text together and gives it meaning.
• It is related to the broader concept
of coherence. (Wikipedia)

‘Topic Modelling’
According to Wikipedia:
“In machine learning and natural language processing, a
topic model is a type of statistical model for discovering the
abstract "topics" that occur in a collection of documents.
Topic modeling is a frequently used text-mining tool for
discovery of hidden semantic structures in a text body”.

A collection of
text based web
pages (corpus)

A collection of
text based
documents
(corpus)

Term
Clarification
Machine learning ->
dataset
Information retrieval
-> corpora / corpus

We can disambiguate
through co-occurrence

“You shall know a
word by the company
it keeps” (John Rupert
Firth, 1957)

Using similarity &
relatedness

Relatedness is NOT
about structured
data

2 words are similar if they
co-occur with similar words

First Level Relatedness –
Words that appear together
in the same sentence

2 words are similar if they occur in a given
grammatical relation with the same words
Harvest Peel Eat Slice

Second Level Relatedness –
words that co-occur with the
same ‘other’ words

WordSim353
Dataset
Some words
with high
similarity or
relatedness
Word 1 Word 2 Human (mean)
tiger tiger 10
fuck sex 9.44
journey voyage 9.29
midday noon 9.29
dollar buck 9.22
money cash 9.15
coast shore 9.1
money cash 9.08
money currency 9.04
football soccer 9.03
magician wizard 9.02
type kind 8.97
gem jewel 8.96
car automobile 8.94
street avenue 8.88
asylum madhouse 8.87
boy lad 8.83
environment ecology 8.81
furnace stove 8.79
seafood lobster 8.7
mile kilometer 8.66
Maradona football 8.62
OPEC oil 8.59
king queen 8.58
murder manslaughter 8.53
money bank 8.5
computer software 8.5
Jerusalem Israel 8.46
vodka gin 8.46
planet star 8.45

A Moving Word ‘Context
Window’

Typical window size might be 5
Source Text
Writing a list of random sentences is harder than I Initially thought it would be
11 letters (5 left and 5 right of the moving target word)

Vector representations of words (Word Vectors)

Nearest Neighbours (Similarity)
Evaluations
KNN – K-Nearest-Neighbour

Continuous Bag of Words (CBOW)
Taking a continuous bag of
words with no context utilize a
context window of n size n-gram)
to ascertain words which are
similar or related using Euclidean
distances to create vector
models and word embeddings

The opposite of CBOW (continuous bag of words)
Skip-gram model

Both models learn
the weights of the
similarity and
relatedness
distances

Vector
space
models are
being
expanded
beyond
Word2Vec
Word2Vec
Doc2Vec
Sentence2Vec
Paragraph2Vec

Word2Vec
Single words
Word embeddings
01
Doc2Vec
Words & meta data
Word embeddings
Document embeddings
02
Sentence2Vec
Chunks of words
More context available
03
Paragraph2Vec
Full paragraphs
Even more context and
semantics
04

In order to understand what
words in documents constitute
‘relevance’ to a query

Testing Similarity and Relatedness
http://ws4jdemo.appspot.com

GloVe: Global
Vectors for
Word
Representation
• What is GloVe?
• “GloVe is an unsupervised learning
algorithm for obtaining vector
representations for words. Training is
performed on aggregated global word-word
co-occurrence statistics from a corpus, and
the resulting representations showcase
interesting linear substructures of the word
vector space.”
• (https://nlp.stanford.edu/projects/glove/)

Linear
Substructures
in GloVe
• Sometimes more than one word pair is needed to understand
meaning
• Particularly when the words are opposites of each other (e.g.
man and woman)
• By adding addition word pairs further semantic hints provide
context to understand meaning of concepts

GloVe: Nearest Neighbour
Cosine Similarity
• Nearest words to Frog
• https://nlp.stanford.edu/projects/glove/
Glove2Vec

Concept2Vec
Ontological
concepts

Concept Graphs Using Relatedness

Wikipedia is a gold mine for IR researchers – each page is considered a concept

Similarity and
Relatedness
Similarity – words that mean the
same or nearly the same
Relatedness - Words that live
together within a topic / co-occur
in the same corpora / collection /
sub-section of a collection

‘Part of
Speech’
(POS)
tagging

A website is NOT unstructured data
It has a hierarchy
It has weighted
sections
It has metadata
It (often) has a
tree like
structure

• BM25
• BM25+
• BM25L
• OKAPI BM25

Probably
BM25F is
used for
web pages
BM25F allows for web pages which have
structure compared with normal flat text
output (e.g. from text files) (additional fields)
Takes into consideration elements such as
page title, meta data, sections, footers,
headers, anchor text
Adds weights for different elements on a
page

Anchor text is
included in
BM25F

Semi-
structured
data
• Hierarchical nature of a
website
• Tree structure
• Well sectioned and
including clear containers
and meta headings
• An ontology map between
semi and structured

Lexical
‘nyms’
Antonym – The
opposite meaning
Synonym – The same
meaning
Meronym – Part of
something else (whole)
– e.g. finger (hand)
(Part / whole relations)
Hyponym – A subset of
something else – e.g.
fork (cutlery)
Hypernym – A superset
(superordinate) – e.g.
colour hypernym

TF:IDF
Term frequency:
Inverse document
frequency

TF:IDF LOCAL v GLOBAL?
(The whole document
collection)
Across your site??
Across all documents
relevant for the topic??

A website is NOT unstructured data
It has a
hierarchy
It has
weighted
sections
It has
metadata
It (often)
has a tree
like
structure

Keyword
Stuffing or
TF:IDF Weights

“Easter” Query Intent Shift

Predicting the future
with Web Dynamics
• The journey to predict the future: Kira Radinsky at
TEDxHiriya

Find out what correlates and when

How can we improve
our topical relatedness?

Cancel your noise
Unstructured data is
voluminous
Filled with
irrelevance
Lacks focus
Riddled with
stopwords
Lots of meaningless
text and further
ambiguating jabber

Disambiguate lean content
with powerful structured
data

And strong linking nearest
neighbour topically rich pages

Use well
organised
hyponyms
(Hyponomy and
Hypernymy)
• Cutlery
• Spoons
• Dessert
• Tea
• Table
• Forks
• Knives
• Carving
• Steak
• Butchers
Hypernym
Hyponym + Hypernym
(co) Hyponym
(co) Hyponym
(co) Hyponym
Hyponym + Hypernym
Hyponym + Hypernym
(co) Hyponym
(co) Hyponym
(co) Hyponym
Simple unordered list with children

Image alt tags (and image title
tags) help with disambiguation too

Stemming &
Lemmatization
Both aim to take a word back to its common
base form
Avoid keyword
stuffing… be aware of
stemming and
lemmatization

Tables are relational databases
too – use liberally (with headers)
ID Event Name Event Type Event City Event Country
1 Ungagged Las
Vegas
Conference Las Vegas US
2 Ungagged London Conference London UK
3 State of Digital Conference London UK
4 Brighton SEO Conference Brighton UK

Stay in your
topical lane
Topical drift / dilution is
a big problem

Merge content but watch out for topical
dilution – what did Wikipedia redirect?
The whole is greater
than the sum of its
parts

Check Wikipedia
redirects for your niche

Wikipedia redirects
• Dbo:wikiPageRedirects

In theory… the consolidated page should rank
higher… but…

Throw the words into a word cloud

So the most
prominent
topics &
nuances appear

Watch out for topic dilution /
drift in user generated content

Educate crazy taggers but not before you’ve used their topic tags to fix
dilution

All the anchors &
contextual &
navigational
internal linking

Even if it is just a breadcrumb trail

Sources and References
• Kira Radinsky Tedx Talk -
https://www.youtube.com/watch?v=gAifa_CVGCY
• Stop Word Library Example -
https://sites.google.com/site/kevinbouge/stopwords-lists
• Image credit: Bird, Steven, Edward Loper and Ewan Klein
(2009), Natural Language Processing with Python. O’Reilly Media Inc.
• The work of - Radinsky, K., 2012, December. Learning to predict the
future using Web knowledge and dynamics. In ACM SIGIR Forum(Vol.
46, No. 2, pp. 114-115). ACM.
• http://9ol.es/porter_js_demo.html

Sources and References
• Lohar, P., Ganguly, D., Afli, H., Way, A. and Jones, G.J., 2016. FaDA:
Fast document aligner using word embedding. The Prague Bulletin of
Mathematical Linguistics, 106(1), pp.169-179.
• https://en.wikipedia.org/wiki/List_of_locations_in_Australia_with_an
_English_name
• Barbara Plank | Keynote - Natural Language Processing: -
https://www.youtube.com/watch?v=Wl6c0OpF6Ho

Further Reading
• https://github.com/Hironsan/awesome-embedding-models
• https://nlp.stanford.edu/IR-book/html/htmledition/document-
representations-and-measures-of-relatedness-in-vector-spaces-
1.html
• https://www.youtube.com/watch?time_continue=790&v=wI5O-
lYLBCw
• https://en.wikipedia.org/wiki/Euclidean_distance
• Ibrahim, O.A.S. and Landa-Silva, D., 2016. Term frequency with
average term occurrences for textual information retrieval. Soft
Computing, 20(8), pp.3045-3061.

Further Reading
• Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C. and McGinnity,
M., 2018. Advances in Computational Intelligence
Systems. Intelligence.
• https://www.researchgate.net/post/What_is_the_difference_betwee
n_TFIDF_and_term_distribution_for_feature_selection
• https://radimrehurek.com/gensim/models/word2vec.html
• McDonald, R., Brokos, G.I. and Androutsopoulos, I., 2018. Deep
relevance ranking using enhanced document-query interactions. arXiv
preprint arXiv:1809.01682.

Further Reading
Boyd-Graber, J., Hu, Y. and Mimno, D., 2017. Applications of topic
models. Foundations and Trends® in Information Retrieval, 11(2-3),
pp.143-296.
https://nlp.stanford.edu/projects/glove/
https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-
1.html
Sherkat, E. and Milios, E.E., 2017, June. Vector embedding of wikipedia
concepts and entities. In International conference on applications of
natural language to information systems (pp. 418-428). Springer, Cham.

Precision and Recall
GOLD
STANDARD
Lots of results inaccurately
deemed highly relevant
retrieved. Lots of results
inaccurately deemed
irrelevant not retrieved
Maybe not enough relevant
documents to fetch much
here
Lots of documents came back
but not enough highly
relevant
Many highly relevant docs to
meet informational need
returned
Results were highly relevant
but not enough came back
Maybe being ‘too picky’
A wide net was cast but not
many of the right type of
fishes caught
https://medium.com/@klintcho/explaining-precision-and-recall-c770eb9c69e9
Manning, C.D., Manning, C.D. and Schütze, H., 1999. Foundations of statistical natural language processing. MIT press

Canadian towns using English names

You should be aware
‘indexing’ and
‘ranking’ are two very
separate things

Example context window size 3
Source Text Training
Samples
The quick brown fox jumps over the lazy dog (the, quick)
(the,
brown)
(the, fox)
The quick brown fox jumps over the lazy dog (quick, the)
(quick,
brown)
(quick, fox)
(quick,
jumps)
The quick brown fox jumps over the lazy dog Etcetera
The quick brown fox jumps over the lazy dog Etcetera

Stemming
(Popular stemmer is PorterStemmer)
• https://github.com/johnpcarty/mysql-porter-
stemmer/blob/master/porterstemmer.sql
Conditions Suffix Replacement Examples
-------------------------- ------- ------------- -----------------------
(m>1) al NULL revival -> reviv
(m>1) ance NULL allowance -> allow
(m>1) ence NULL inference -> infer
(m>1) er NULL airliner-> airlin
(m>1) ic NULL gyroscopic -> gyroscop
(m>1) able NULL adjustable -> adjust
(m>1) ible NULL defensible -> defens
(m>1) ant NULL irritant -> irrit
(m>1) ement NULL replacement -> replac
(m>1) ment NULL adjustment -> adjust
(m>1) ent NULL dependent -> depend
(m>1 and (*<S> or *<T>)) ion NULL adoption -> adopt
(m>1) ou NULL homologou-> homolog
(m>1) ism NULL communism-> commun
(m>1) ate NULL activate -> activ
(m>1) iti NULL angulariti -> angular
(m>1) ous NULL homologous -> homolog
(m>1) ive NULL effective -> effect
(m>1) ize NULL bowdlerize -> bowdler

WordSim353
Dataset
Some words
with low
similarity or
relatedness
Word 1 Word 2 Human (mean)
king cabbage 0.23
professor cucumber 0.31
chord smile 0.54
noon string 0.54
rooster voyage 0.62
sugar approach 0.88
stock jaguar 0.92
stock life 0.92
monk slave 0.92
lad wizard 0.92
delay racism 1.19
stock CD 1.31
drink ear 1.31
stock phone 1.62
holy sex 1.62
production hike 1.75
precedent group 1.77
stock egg 1.81
energy secretary 1.81
month hotel 1.81
forest graveyard 1.85
cup substance 1.92
possibility girl 1.94
cemetery woodland 2.08
glass magician 2.08
cup entity 2.15
Wednesday news 2.22
direction combination 2.25

Coast and
Shore Example
• Coast and shore have a similar meaning
• They co-occur in first and second level
relatedness documents in a collection
• They would receive a high score in
similarity

SPARQL Editor &
WikiPageRedirects

There are 68
variants for
mobile
phone
redirected
on Wikipedia

Context window (n = window size = 2) (2 words either side)
Source Text Training
Samples
The quick brown fox jumps over the lazy dog (the, quick)
(the,
brown)
The quick brown fox jumps over the lazy dog (quick, the)
(quick,
brown)
(quick, fox)
The quick brown fox jumps over the lazy dog (brown,
the)
(brown,
quick)
(brown, fox)
(brown,
jumps)
The quick brown fox jumps over the lazy dog (fox, quick)
(fox, brown)
(fox, jumps)
(fox, over)

Other areas of IR Including… but not limited to
Mobile IR
Contextual IR
Natural language processing
Conversational search
Similarity search
Recommender systems
Image IR
Music IR

Recall and Precision in IR
Precision is the
best results – the
most relevant for
the query
Recall is all of the
results returned
for the query

Likely based on
co-occurrence
data
https://slideplayer.com/slide/13138343/ - Query Expansion and Relevance Feedback
Increases recall (more
results) but may reduce
precision
Query Expansion Example

Increase Recall – 2 Main Methods
Query Expansion
/ Query Rewriting
Query Relaxation
• Ignore stop words in query
• Relax specificity (remove specific)
• Use lexical database (a “knowledge
graph” (e.g. Wordnet) to find a more
general term (hypernym - superset)
• Use Part of Speech Tagger to identify
structure of query & expand nouns
(things)
• Preserve head noun and strip
modifiers (e.g. dog)
• Use Word2Vec to identify semantics
from a vector space using word
embeddings from the query – find
semantics (related and similar)
Take bits away from the query
Add bits to the query
• Broaden the query
• Expand abbreviations
• Stemming and lemmatization (in
reverse)
• Use Word2Vec to find abbreviations
in a vector space using semantic
similarity
• Use synonyms (same / very similar
meanings)
• Use minimum cosine similarity from
Word2Vec as safety net

Most popular word embedding
tool – probably Word2Vec (new
ones are emerging)

Stemming &
Lemmatization
Stemming
• Runs a series of rules to
chop known ‘stems’ off the
end of words
• Suffix stripping algorithm
• Often leaves incorrect
endings/ crude performance
(understemming)
• Popular – PorterStemmer
(Martin Porter)
• Example: “alumnus” ->
“alumnu”
Lemmatization
• Aims to do things properly
• Tool from natural language
processing
• Needs a complete
vocabulary and
morphological analysis to
work well
• Also not perfect
• Relies on lexical knowledge
base like WordNet to correct
base form

Part of Speech Tags (Python NLTK Library)
• NNPS proper noun, plural
‘Americans’
• PDT predeterminer ‘all the kids’
• POS possessive ending parent’s
• PRP personal pronoun I, he, she
• PRP$ possessive pronoun my, his,
hers
• RB adverb very, silently,
• RBR adverb, comparative better
• RBS adverb, superlative best
• RP particle give up
• TO, to go ‘to’ the store.
• UH interjection, errrrrrrrm
• CC coordinating conjunction
• CD cardinal digit
• DT determiner
• EX existential there (like: “there is” … think
of it like “there exists”)
• FW foreign word
• IN preposition/subordinating conjunction
• JJ adjective ‘big’
• JJR adjective, comparative ‘bigger’
• JJS adjective, superlative ‘biggest’
• LS list marker 1)
• MD modal could, will
• NN noun, singular ‘desk’
• NNS noun plural ‘desks’
• NNP proper noun, singular ‘Harrison’
• VB verb, base form take
• VBD verb, past tense took
• VBG verb, gerund/present
participle taking
• VBN verb, past participle taken
• VBP verb, sing. present, non-3d
take
• VBZ verb, 3rd person sing.
present takes
• WDT wh-determiner which
• WP wh-pronoun who, what
• WP$ possessive wh-pronoun
whose
• WRB wh-abverb where, when

compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
Stop Word Libraries are Huge

compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
a a's able about above according accordingly across actually after
afterwards again against ain't all allow allows almost alone along
already also although always am among amongst an and another
any anybody anyhow anyone anything anyway anyways anywhere apart appear
appreciate appropriate are aren't around as aside ask asking associate
d
at available away awfully
‘A’ words in an EN stop word list

• Anaphora resolution (AR) which most commonly appears as pronoun
resolution is the problem of resolving references to earlier or later
items in the discourse.
• Example: "John found the love of his life" where 'his' refers to 'John’
• ‘His’ refers to John (easily understood by humans but not so much by
machines)
Example and definition from: https://nlp.stanford.edu/courses/cs224n/2003/fp/iqsayed/project_report.pdf
Anaphora

Cataphora –
According to
Wikipedia
In linguistics cataphora is the use of an
expression or word that co-refers with a later,
more specific, expression in the discourse.
EXAMPLE: “When he arrived home, John went
to sleep”
HE IS JOHN, BUT JOHN WAS NOT KNOWN
WHEN ‘HE’ WAS REFERRED TO – SO CAUSES
CONFUSION REGARDING WHO ‘HE’ IS

Anaphora and
Coreference
Resolution
• There are some algorithms in place to handle
anaphora resolution
• In conversational search this still struggles after
a few multi-turn questions

NLTK Toolkit
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

Chunk =
Usually chunks
of words
Token =
Usually a word

Digital Marketing
Events – Using Tables
• https://research.google.com
/tables

Information Architecture for the World Wide Web, 3rd Edition by Louis Rosenfeld, Peter Morville

How can all this help you as an SEO?
• Consider exploring your topical models by using noindex admin only
tag clouds to visualise the topics you see
• Utilise relatedness considering words likely in Sim353 or other data
sources
• Pass supporting topical hints to emphasise the meanings in 1st and 2nd
level relatedness
• Consider query intent shift and mobile IR / contextual search as niche
fields of IR
• Utilise semi-structured elements to strengthen unstructured pages in
noisy (particularly longer) pages

How can all this help you as an SEO?
• Further disambiguation measures on locations used in different
countries with same name
• Be consistent in your naming conventions. Refer to Wikipedia if in
doubt – check their redirects on terms
• Other semantic clues for entities with same name – e.g. gender / role
/ location / geographic clues
• Utilise anchors to emphasise further from 2nd level relatedness pages
• Utilise co-occurring terms from databases considered similar /
connected / related

Using topic modelling frameworks for NLP and semantic search

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Using topic modelling frameworks for NLP and semantic search

Similaire à Using topic modelling frameworks for NLP and semantic search (20)

Plus de Dawn Anderson MSc DigM

Plus de Dawn Anderson MSc DigM (20)

Dernier

Dernier (20)

Using topic modelling frameworks for NLP and semantic search