Google BERT and Family and the Natural Language Understanding Leaderboard Race

#pubcon
Google BERT & Family & The
Natural Language
Understanding Leaderboard
Race
Presented by: Dawn Anderson @BeBertey

#pubcon
The Problem with Words

#pubcon
Words are
problematic.
Ambiguous…
polysemous…
synonymous

#pubcon
Ambiguity and Polysemy
Almost every other
word in the English
language has
multiple meanings

#pubcon
In spoken word it is
even worse because
of homophones and
prosody

#pubcon
Like “four
candles”
and “fork
handles”

#pubcon
Which does not bode well for
conversational search into the future

#pubcon
Today’s Topic: Current
Search Engine Solutions For
Dealing with the Problem of
Words

#pubcon
Word’s Context
• ”The meaning of a word is its use in a
language” (Ludwig Wittgenstein,
Philosopher, 1953)
• Image attribution: Moritz Nähr [Public
domain]

#pubcon
Word’s Context
Changes As A Sentence
Evolves
• The meaning of a word changes (literally) as
a sentence develops
• Due to the multiple parts of speech a word
could be in a given content

#pubcon
Like “like”
We can see in just this
short sentence alone
using Stanford Part of
Speech Tagger Online
that the word like is
considered to be 2
separate parts of speech
http://nlp.stanford.edu:8080/parser/index.jsp

#pubcon
Like “like”
For example: The word ”like” has several
possible parts of speech (including ‘verb’,
‘noun’, ‘adjective’)
POS = Part of Speech

#pubcon
An important
part of this is
‘Part of
Speech’ (POS)
tagging

#pubcon
Chunking and Tokenization

#pubcon
Natural language
understanding is NOT
structured data

#pubcon
Structured data
helps to
disambiguate but
what about the ‘hot
mess’ in between?

#pubcon
Part of Speech Tagging (POS)

#pubcon
Example Part of Speech Tagging (POS)
• Pubcon
• is
• a
• great
• conference
• NNP
• VBZ
• DT
• JJ
• NN
• Proper noun,
singular
• Verb (3rd person,
singular, present)
• Determiner
• Adjective
• Noun

#pubcon
Popular POS (Part of
Speech) Taggers
• Penn Treebank Tagger -> 36
different part of speech tags
• CLAWS 7 (C7) Tagset -> 146
• Brown Corpus Tagger -> 81

#pubcon
Pronouns are
problematic too

#pubcon
Computer
programs lose
track of who
is who easily
I’m confused… Here…
Have some flowers instead

#pubcon
Named Entity
Recognition is NOT
Named Entity
Disambiguation

#pubcon
Ontology Driven Natural Language Processing
Image credit: IBM
https://www.ibm.com/developerworks/community/blogs/nlp/entry/ontology_driven_nlp

#pubcon
But even named
entities can be
polysemic

#pubcon
Did you mean?
• Amadeus Mozart
(composer)
• Mozart Street
• Mozart Cafe

#pubcon
AND VERBALLY…WHO
(WHAT) ARE YOU
TALKING ABOUT?
”LYNDSEY DOYLE”
OR ”LINSEED OIL”?

#pubcon
AND NOT EVERYONE
OR THING IS MAPPED
TO THE KNOWLEDGE
GRAPH

#pubcon
EVEN IF WE UNDERSTAND
THE ENTITY (THING) ITSELF
WE NEED TO UNDERSTAND
WORD’S CONTEXT

#pubcon
Semantic context matters
• He kicked the bucket
• I have yet to cross that off my bucket list
• The bucket was filled with water

#pubcon
How can search
engines fill in the
gaps between
named entities?

#pubcon
When they can’t even tell the difference between Pomeranians and pancakes

#pubcon
They need
‘Text
cohesion’
Cohesion is
the grammatical and
lexical linking within a text
or sentence that holds a text
together and gives it meaning.
Without surrounding words the
word bucket could mean
anything in a sentence

#pubcon
Word’s Company
“You shall know a word by
the company it keeps” (John
Rupert Firth, Linguist,1957)
Image Attribution: Wikimedia Commons Public
Domain

#pubcon
Words That
Live
Together Are
Strongly
Connected
• Co-occurrence
• Co-occurrence provides context
• Co-occurrence changes word’s meaning
• Words that share similar neighbours are
also strongly connected
• Similarity & relatedness

#pubcon
Natural Language
Disambiguation

#pubcon
Natural Language
Recognition is NOT
Understanding
• Natural language understanding
requires understanding of context and
common sense reasoning. VERY
challenging for machines, but largely
straightforward for humans.

#pubcon
Language models are trained
on very large text corpora or
collections (loads of words) to
learn distributional similarity

#pubcon
Vector representations of words (Word Vectors)

#pubcon
And build
vector space
models for
word
embeddings
king - man +
woman =
queen

#pubcon
A Moving Word ‘Context Window’

#pubcon
Typical window size might be 5
Source Text
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
11 letters (5 left and 5 right of the moving target word)

#pubcon
Example context window size 3
Source Text Training
Sample
s
The quick brown fox jumps over the lazy dog (the,
quick)
(the,
brown)
(the, fox)
The quick brown fox jumps over the lazy dog (quick,
the)
(quick,
brown)
(quick,
fox)
(quick,
jumps)
The quick brown fox jumps over the lazy dog Etcetera
The quick brown fox jumps over the lazy dog Etcetera

#pubcon
Tensorflow (tool)
& e.g. Word2Vec
or Glove2Vec
(language models)

#pubcon
Continuous Bag of Words
(CBoW) (Method) or Skip-
gram (Opposite of CBoW)
Continuous Bag of Words - Taking a
continuous bag of words with no context
utilize a context window of n size n-
gram) to ascertain words which are
similar or related using Euclidean
distances to create vector models and
word embeddings

#pubcon
Models learn the
weights of the
similarity and
relatedness
distances

#pubcon
Google’s Topic
Layer is a new
Layer in the
Knowledge Graph

#pubcon
EXAMPLE MICROSOFT CONCEPT DISTRIBUTION LAYER

#pubcon
PAST LANGUAGE
MODELS (E.G.
WORD2VEC &
GLOVE2VEC) BUILT
CONTEXT-FREE WORD
EMBEDDINGS

#pubcon
Most language modellers are uni-directional
Source Text
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
They can traverse over the word’s context window from only left to right or
right to left. Only in one direction, but not both at the same time

#pubcon
They can only look at
words in the context
window before and not
the words in the rest of
the sentence. Nor
sentence to follow next

#pubcon
OFTEN THE NEXT
SENTENCE REALLY
MATTERS

#pubcon
I
Remember
When My
Grandad
Kicked The
Bucket
BERT is able to
understand the
NEXT sentence
The NEXT sentence
here provides the
context

#pubcon
“How far do you reckon
I could kick this
bucket?”

#pubcon
Did you mean “bank”?
Or did you mean “bank”?

#pubcon
NER Example
• E.g.
Sentence: “Taylor Swift will launch her new album in Apple Music.”
• NER result:“Taylor[B-PER] Swift[I-PER] will[O] launch[O] her[O]
new[O] album[O] in[O] Apple[B-ORG] Music[I-ORG].[O]”
• PS:
[O] means no meaning
[B-PER]/[I-PER] means person name
[B-ORG]/[I-ORG] means organization name
Source: https://medium.com/@yingbiao/ner-with-bert-in-action-
936ff275bc73

#pubcon
Not the pomeranian BERT

#pubcon
BERT (Bidirectional
Encoder
Representation
from Transformers)

#pubcon
Transformers (Attention simultaneously)

#pubcon
11 NLP Tasks
• BERT advances the State of
the Art (SOT) of 11 NLP
Tasks

#pubcon
BERT is different. BERT uses bi-directional
language modelling. The FIRST to do this
Source Text
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Writin
g
a lis
t
of rando
m
sentence
s
is harde
r
than I Initiall
y
though
t
it woul
d
be
Bert can see both the left and the right hand side of the target word

#pubcon
BERT HAS BEEN OPEN
SOURCED BY GOOGLE AI

#pubcon
Google’s move to
open source BERT
may change natural
language processing
forever

#pubcon
Bert uses ‘Transformers’ &
’Masked Language Modelling’

#pubcon
Masked Language
Modelling Stops
The Target Word
From Seeing Itself

#pubcon
BERT can see the WHOLE
sentence on either side
of a word (contextual
language modelling) and
all of the words almost
at once

#pubcon
BERT has been pre-trained on a
lot of words … on the whole of
the English Wikipedia (2,500
million words)

#pubcon
Previously Uni-Directional
Previously all language
models were uni-
directional so could
only move the context
window in one
directional
A moving window of ‘n’
words (either left or
right of a target word)
to understand word’s
context

#pubcon
Google BERT Paper
• Devlin, J., Chang,
M.W., Lee, K. and
Toutanova, K., 2018.
Bert: Pre-training of
deep bidirectional
transformers for
language
understanding. arXiv
preprint
arXiv:1810.04805.

#pubcon
BERT can identify which sentence
likely comes next from two choices

#pubcon
THE ML & NLP COMMUNITY ARE VERY EXCITED
ABOUT BERT

#pubcon
EVERYBODY WANTS TO ‘BUILD-A-
BERT. NOW THERE ARE LOADS OF
ALGORITHMS WITH BERT

#pubcon
VANILLA BERT PROVIDES A PRE-TRAINED STARTING POINT
LAYER FOR NEURAL NETWORKS IN MACHINE LEARNING &
NATURAL LANGUAGE DIVERSE TASKS

#pubcon
Whilst BERT has
been pre-trained on
Wikipedia it is fine-
tuned on ‘questions
and answer
datasets’

#pubcon
Andre Broder’s Call to Arms in Assistive AI

#pubcon
Researchers compete over Natural Language Understanding
with e.g. SQuAD (Stanford Question & Answering Dataset)

#pubcon
BERT Has
Dramatically
Accelerated NLU

#pubcon
BERT now even beats the
human reasoning benchmark on
SQuAD

#pubcon
Not to be outdone – Microsoft also extends on BERT with MT-DNN

#pubcon
In GLUE – It’s Humans, MT-DNN, then BERT

#pubcon
Glue Benchmark Leaderboard

#pubcon
Stanford Question & Answering DataSet

#pubcon
Includes Adversarial Questions: Making Sure Machines Know What They Don’t Know

#pubcon
MS MARCO: A Human Generated
MAchine Reading Comprehension
Dataset
• Rajpurkar, P., Zhang, J.,
Lopyrev, K. and Liang, P.,
2016. Squad: 100,000+
questions for machine
comprehension of text. arXiv
preprint arXiv:1606.05250.

#pubcon
Real Bing
Questions
Feed MS
MARCO
From real Bing anonymized
queries

#pubcon
Teaching Machines Commonsense
Zellers, R., Bisk, Y.,
Schwartz, R. and Choi,
Y., 2018. Swag: A large-
scale adversarial dataset
for grounded
commonsense
inference. arXiv preprint
arXiv:1808.05326.

#pubcon
BERT Has Grown
• Further iterations have grown in size so that the
models are arguably so large they are inefficience
and unscaleable

#pubcon
ALBERT
BERT’s successor
from Google
Joint work
between Google
Research & Toyota
Technological
Institute

#pubcon
Distil-BERT
(Distillated BERT)

#pubcon
Algorithmic
Bias
Concerns
Ricardo Baeza-Yates' work - Bias on the Web
NoBIAS Project
IBM initiatives to prevent bias
BERT does not know why it makes decisions
BERT is considered a ‘black box algorithm’
Programmatic bias is a concern
Algorithmic justice league is active

#pubcon
Keep in Touch
•@dawnieando
•@BeBertey

#pubcon
References
• Rajpurkar, P., Zhang, J., Lopyrev, K. and
Liang, P., 2016. Squad: 100,000+ questions
for machine comprehension of text. arXiv
preprint arXiv:1606.05250.
• Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, Ł. and Polosukhin, I., 2017.
Attention is all you need. In Advances in
neural information processing
systems (pp. 5998-6008).

Google BERT and Family and the Natural Language Understanding Leaderboard Race

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (11)

Similaire à Google BERT and Family and the Natural Language Understanding Leaderboard Race

Similaire à Google BERT and Family and the Natural Language Understanding Leaderboard Race (20)

Plus de Dawn Anderson MSc DigM

Plus de Dawn Anderson MSc DigM (20)

Dernier

Dernier (20)

Google BERT and Family and the Natural Language Understanding Leaderboard Race