SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
BERT: Pre-training of Deep
Bidirectional Transformers for
Language Understanding
Devlin et al., 2018 (Google AI Language)
Presenter
Phạm Quang Nhật Minh
NLP Researcher
Alt Vietnam
al+ AI Seminar No. 7
2018/12/21
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 2
Research context
• Language model pre-training has been used to
improve many NLP tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained
language representations to downstream tasks
• Feature-based: include pre-trained representations as
additional features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and
fine-tune the pre-trained parameters (e.g., OpenAI GPT,
ULMFit)
12/21/18 al+ AI Seminar No.7 3
Limitations of current techniques
• Language models in pre-training are unidirectional,
they restrict the power of the pre-trained
representations
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language
models
• Solution BERT: Bidirectional Encoder
Representations from Transformers
12/21/18 al+ AI Seminar No.7 4
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
12/21/18 al+ AI Seminar No.7 5
12/21/18 al+ AI Seminar No.7 6
BERT: Bidirectional Encoder
Representations from Transformers
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
12/21/18 al+ AI Seminar No.7 7
12/21/18 al+ AI Seminar No.7 8
Differences in pre-training model architectures: BERT,
OpenAI GPT, and ELMo
Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
12/21/18 al+ AI Seminar No.7 9
Vaswani et al. (2017) Attention is all you need
Encoder Block
Encoder Block
Encoder Block
Input sequence
Inside an Encoder Block
12/21/18 al+ AI Seminar No.7 10
Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3
In BERT experiments, the number
of blocks N was chosen to be 12
and 24.
Blocks do not share weights with
each other
Transformer Encoders: Key Concepts
12/21/18 al+ AI Seminar No.7 11
Multi-head
self-attention
Self-
attention
Transformer
Encoders
Position
Encoding
Layer
NormalizationResidual
Connections
Position-wise
Feed Forward
Network
Self-Attention
12/21/18 al+ AI Seminar No.7 12
Image source: https://jalammar.github.io/illustrated-transformer/
Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
12/21/18 al+ AI Seminar No.7 13
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
Attention ', ), * = softmax
')1
23
* dk is the dimension of
key vectors
Multi-Head Attention
12/21/18 al+ AI Seminar No.7 14
X1
X2
...
Head #0 Head #1 Head #7
Concat
Linear Projection
Use a weight
matrix Wo
Position Encoding
• Position Encoding is used to make use of the order of the
sequence
• Since the model contains no recurrence and no convolution
• In Vawasni et al., 2017, authors used sine and cosine functions of
different frequencies
!"($%&,()) = sin
/01
10000
()
4!"#$%
!"($%&,()56) = cos
/01
10000
()
4!"#$%
• pos is the position and 9 is the dimension
12/21/18 al+ AI Seminar No.7 15
Input Representation
12/21/18 al+ AI Seminar No.7 16
• Token Embeddings: Use pretrained WordPiece embeddings
• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
Task#1: Masked LM
• 15% of the words are masked at random
• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way
(example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog is
apple”
• 10% were left intact: “My dog is hairy”
12/21/18 al+ AI Seminar No.7 17
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
12/21/18 al+ AI Seminar No.7 18
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext
Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are
flight ##less birds [SEP]
Label = NotNext
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• To generate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood
12/21/18 al+ AI Seminar No.7 19
Fine-tuning procedure
• For sequence-level classification task
• Obtain the representation of the input sequence by using the
final hidden state (hidden state at the position of the special
token [CLS]) ! ∈ #$
• Just add a classification layer and use softmax to calculate
label probabilities. Parameters W ∈ #&×$
( = softmax(!23
)
12/21/18 al+ AI Seminar No.7 20
Fine-tuning procedure
• For sequence-level classification task
• All of the parameters of BERT and W are fine-tuned
jointly
• Most model hyperparameters are the same as in
pre-training
• except the batch size, learning rate, and number of
training epochs
12/21/18 al+ AI Seminar No.7 21
Fine-tuning procedure
• Token tagging task (e.g., Named Entity Recognition)
• Feed the final hidden representation !" ∈ $% for each
token & into a classification layer for the tagset (NER
label set)
• To make the task compatible with WordPiece
tokenization
• Predict the tag for the first sub-token of a word
• No prediction is made for X
12/21/18 al+ AI Seminar No.7 22
Fine-tuning procedure
12/21/18 al+ AI Seminar No.7 23
Single Sentence Tagging Tasks: CoNLL-2003 NER
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Input Question:
Where do water droplets collide with ice
crystals to form precipitation?
• Input Paragraph:
.... Precipitation forms as smaller
dropletscoalesce via collision with other rain
dropsor ice crystals within a cloud. ...
• Output Answer:
within a cloud
12/21/18 al+ AI Seminar No.7 24
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Represent the input question and paragraph as a single
packed sequence
• The question uses the A embedding and the paragraph uses
the B embedding
• New parameters to be learned in fine-tuning are start
vector ! ∈ ℝ$ and end vector % ∈ ℝ$
• Calculate the probability of word & being the start of the
answer span
'( =
*+,-!
∑/ *+,-"
• The training objective is the log-likelihood the correct
and end positions
12/21/18 al+ AI Seminar No.7 25
Comparison of BERT and OpenAI GPT
OpenAI GPT BERT
Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) +
Wikipedia (2,500M)
Use sentence separater ([SEP])
and classifier token ([CLS]) only at
fine-tuning time
BERT learns [SEP], [CLS] and
sentence A/B embeddings during
pre-training
Trained for 1M steps with a batch-
size of 32,000 words
Trained for 1M steps with a batch-
size of 128,000 words
Use the same learning rate of 5e-5
for all fine-tuning experiments
BERT choose a task-specific
learning rate which performs the
best on the development set
12/21/18 al+ AI Seminar No.7 26
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 27
Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
12/21/18 al+ AI Seminar No.7 28
GLUE Results
12/21/18 al+ AI Seminar No.7 29
SQuAD v1.1
12/21/18 al+ AI Seminar No.7 30
Reference: https://rajpurkar.github.io/SQuAD-explorer
Named Entity Recognition
12/21/18 al+ AI Seminar No.7 31
SWAG
• The Situations with Adversarial Generations (SWAG)
• The only task-specific parameters is a vector ! ∈ ℝ$
• The probability distribution is the softmax over the four
choices
%& =
()*+,
∑./0
1
()*+,
12/21/18 al+ AI Seminar No.7 32
SWAG Result
12/21/18 al+ AI Seminar No.7 33
Ablation Studies
• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT
12/21/18 al+ AI Seminar No.7 34
Ablation Studies
• Main findings:
• “Next sentence prediction” (NSP) pre-training is important
for sentence-pair classification task
• Bidirectionality (using MLM pre-training task) contributes
significantly to the performance improvement
12/21/18 al+ AI Seminar No.7 35
Ablation Studies
• Main findings
• Bigger model sizes are better even for small-scale tasks
12/21/18 al+ AI Seminar No.7 36
Conclusions
• Unsupervised pre-training (pre-training language
model) is increasingly adopted in many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks
12/21/18 al+ AI Seminar No.7 37
Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-
pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-
pytorch
• BERT-keras: https://github.com/Separius/BERT-
keras
12/21/18 al+ AI Seminar No.7 38
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than
100 languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added
around every character in the CJI Unicode rage before
applying WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model
12/21/18 al+ AI Seminar No.7 39
References
1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv
preprint arXiv:1706.03762.
https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.ht
ml, by harvardnlp.
4. Dissecting BERT: https://medium.com/dissecting-bert
12/21/18 al+ AI Seminar No.7 40

Contenu connexe

Tendances

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 

Tendances (20)

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
BERT
BERTBERT
BERT
 
BERT
BERTBERT
BERT
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 

Similaire à BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...DataScienceConferenc1
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxSan Kim
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesSchwannden Kuo
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptxssuser1e7611
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsBithiah Yuan
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 

Similaire à BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
 
Python for IoT
Python for IoTPython for IoT
Python for IoT
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 

Plus de Minh Pham

Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTPrompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTMinh Pham
 
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...Minh Pham
 
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...Minh Pham
 
Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Minh Pham
 
Giới thiệu về AIML
Giới thiệu về AIMLGiới thiệu về AIML
Giới thiệu về AIMLMinh Pham
 
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMinh Pham
 
Deep Contexualized Representation
Deep Contexualized RepresentationDeep Contexualized Representation
Deep Contexualized RepresentationMinh Pham
 
Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Minh Pham
 
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...Minh Pham
 
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017Minh Pham
 
Research Methods in Natural Language Processing
Research Methods in Natural Language ProcessingResearch Methods in Natural Language Processing
Research Methods in Natural Language ProcessingMinh Pham
 
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotCác bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotMinh Pham
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 

Plus de Minh Pham (13)

Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTPrompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
 
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
 
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
 
Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)
 
Giới thiệu về AIML
Giới thiệu về AIMLGiới thiệu về AIML
Giới thiệu về AIML
 
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
 
Deep Contexualized Representation
Deep Contexualized RepresentationDeep Contexualized Representation
Deep Contexualized Representation
 
Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)
 
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
 
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
 
Research Methods in Natural Language Processing
Research Methods in Natural Language ProcessingResearch Methods in Natural Language Processing
Research Methods in Natural Language Processing
 
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotCác bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 

Dernier

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 

Dernier (20)

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al., 2018 (Google AI Language) Presenter Phạm Quang Nhật Minh NLP Researcher Alt Vietnam al+ AI Seminar No. 7 2018/12/21
  • 2. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 2
  • 3. Research context • Language model pre-training has been used to improve many NLP tasks • ELMo (Peters et al., 2018) • OpenAI GPT (Radford et al., 2018) • ULMFit (Howard and Rudder, 2018) • Two existing strategies for applying pre-trained language representations to downstream tasks • Feature-based: include pre-trained representations as additional features (e.g., ELMo) • Fine-tunning: introduce task-specific parameters and fine-tune the pre-trained parameters (e.g., OpenAI GPT, ULMFit) 12/21/18 al+ AI Seminar No.7 3
  • 4. Limitations of current techniques • Language models in pre-training are unidirectional, they restrict the power of the pre-trained representations • OpenAI GPT used left-to-right architecture • ELMo concatenates forward and backward language models • Solution BERT: Bidirectional Encoder Representations from Transformers 12/21/18 al+ AI Seminar No.7 4
  • 5. BERT: Bidirectional Encoder Representations from Transformers • Main ideas • Propose a new pre-training objective so that a deep bidirectional Transformer can be trained • The “masked language model” (MLM): the objective is to predict the original word of a masked word based only on its context • ”Next sentence prediction” • Merits of BERT • Just fine-tune BERT model for specific tasks to achieve state-of-the-art performance • BERT advances the state-of-the-art for eleven NLP tasks 12/21/18 al+ AI Seminar No.7 5
  • 6. 12/21/18 al+ AI Seminar No.7 6 BERT: Bidirectional Encoder Representations from Transformers
  • 7. Model architecture • BERT’s model architecture is a multi-layer bidirectional Transformer encoder • (Vaswani et al., 2017) “Attention is all you need” • Two models with different sizes were investigated • BERTBASE: L=12, H=768, A=12, Total Parameters=110M • (L: number of layers (Transformer blocks), H is the hidden size, A: the number of self-attention heads) • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M 12/21/18 al+ AI Seminar No.7 7
  • 8. 12/21/18 al+ AI Seminar No.7 8 Differences in pre-training model architectures: BERT, OpenAI GPT, and ELMo
  • 9. Transformer Encoders • Transformer is an attention-based architecture for NLP • Transformer composed of two parts: Encoding component and Decoding component • BERT is a multi-layer bidirectional Transformer encoder 12/21/18 al+ AI Seminar No.7 9 Vaswani et al. (2017) Attention is all you need Encoder Block Encoder Block Encoder Block Input sequence
  • 10. Inside an Encoder Block 12/21/18 al+ AI Seminar No.7 10 Source: https://medium.com/dissecting- bert/dissecting-bert-part-1-d3c3d495cdb3 In BERT experiments, the number of blocks N was chosen to be 12 and 24. Blocks do not share weights with each other
  • 11. Transformer Encoders: Key Concepts 12/21/18 al+ AI Seminar No.7 11 Multi-head self-attention Self- attention Transformer Encoders Position Encoding Layer NormalizationResidual Connections Position-wise Feed Forward Network
  • 12. Self-Attention 12/21/18 al+ AI Seminar No.7 12 Image source: https://jalammar.github.io/illustrated-transformer/
  • 13. Self-Attention in Detail • Attention maps a query and a set of key-value pairs to an output • query, keys, and output are all vectors 12/21/18 al+ AI Seminar No.7 13 Input Queries Keys Values X1 X2 q1 q2 k1 k2 v1 v2 Use matrices WQ , WK and WV to project input into query, key and value vectors Attention ', ), * = softmax ')1 23 * dk is the dimension of key vectors
  • 14. Multi-Head Attention 12/21/18 al+ AI Seminar No.7 14 X1 X2 ... Head #0 Head #1 Head #7 Concat Linear Projection Use a weight matrix Wo
  • 15. Position Encoding • Position Encoding is used to make use of the order of the sequence • Since the model contains no recurrence and no convolution • In Vawasni et al., 2017, authors used sine and cosine functions of different frequencies !"($%&,()) = sin /01 10000 () 4!"#$% !"($%&,()56) = cos /01 10000 () 4!"#$% • pos is the position and 9 is the dimension 12/21/18 al+ AI Seminar No.7 15
  • 16. Input Representation 12/21/18 al+ AI Seminar No.7 16 • Token Embeddings: Use pretrained WordPiece embeddings • Position Embeddings: Use learned Position Embeddings • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017. • Added sentence embedding to every tokens of each sentence • Use [CLS] for the classification tasks • Separate sentences by using a special token [SEP]
  • 17. Task#1: Masked LM • 15% of the words are masked at random • and the task is to predict the masked words based on its left and right context • Not all tokens were masked in the same way (example sentence “My dog is hairy”) • 80% were replaced by the <MASK> token: “My dog is <MASK>” • 10% were replaced by a random token: “My dog is apple” • 10% were left intact: “My dog is hairy” 12/21/18 al+ AI Seminar No.7 17
  • 18. Task#2: Next Sentence Prediction • Motivation • Many downstream tasks are based on understanding the relationship between two text sentences • Question Answering (QA) and Natural Language Inference (NLI) • Language modeling does not directly capture that relationship • The task is pre-training binarized next sentence prediction task 12/21/18 al+ AI Seminar No.7 18 Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = isNext Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are flight ##less birds [SEP] Label = NotNext
  • 19. Pre-training procedure • Training data: BooksCorpus (800M words) + English Wikipedia (2,500M words) • To generate each training input sequences: sample two spans of text (A and B) from the corpus • The combined length is ≤ 500 tokens • 50% B is the actual next sentence that follows A and 50% of the time it is a random sentence from the corpus • The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood 12/21/18 al+ AI Seminar No.7 19
  • 20. Fine-tuning procedure • For sequence-level classification task • Obtain the representation of the input sequence by using the final hidden state (hidden state at the position of the special token [CLS]) ! ∈ #$ • Just add a classification layer and use softmax to calculate label probabilities. Parameters W ∈ #&×$ ( = softmax(!23 ) 12/21/18 al+ AI Seminar No.7 20
  • 21. Fine-tuning procedure • For sequence-level classification task • All of the parameters of BERT and W are fine-tuned jointly • Most model hyperparameters are the same as in pre-training • except the batch size, learning rate, and number of training epochs 12/21/18 al+ AI Seminar No.7 21
  • 22. Fine-tuning procedure • Token tagging task (e.g., Named Entity Recognition) • Feed the final hidden representation !" ∈ $% for each token & into a classification layer for the tagset (NER label set) • To make the task compatible with WordPiece tokenization • Predict the tag for the first sub-token of a word • No prediction is made for X 12/21/18 al+ AI Seminar No.7 22
  • 23. Fine-tuning procedure 12/21/18 al+ AI Seminar No.7 23 Single Sentence Tagging Tasks: CoNLL-2003 NER
  • 24. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Input Question: Where do water droplets collide with ice crystals to form precipitation? • Input Paragraph: .... Precipitation forms as smaller dropletscoalesce via collision with other rain dropsor ice crystals within a cloud. ... • Output Answer: within a cloud 12/21/18 al+ AI Seminar No.7 24
  • 25. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Represent the input question and paragraph as a single packed sequence • The question uses the A embedding and the paragraph uses the B embedding • New parameters to be learned in fine-tuning are start vector ! ∈ ℝ$ and end vector % ∈ ℝ$ • Calculate the probability of word & being the start of the answer span '( = *+,-! ∑/ *+,-" • The training objective is the log-likelihood the correct and end positions 12/21/18 al+ AI Seminar No.7 25
  • 26. Comparison of BERT and OpenAI GPT OpenAI GPT BERT Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) + Wikipedia (2,500M) Use sentence separater ([SEP]) and classifier token ([CLS]) only at fine-tuning time BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training Trained for 1M steps with a batch- size of 32,000 words Trained for 1M steps with a batch- size of 128,000 words Use the same learning rate of 5e-5 for all fine-tuning experiments BERT choose a task-specific learning rate which performs the best on the development set 12/21/18 al+ AI Seminar No.7 26
  • 27. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 27
  • 28. Experiments • GLUE (General Language Understanding Evaluation) benchmark • Distribute canonical Train, Dev and Test splits • Labels for Test set are not provided • Datasets in GLUE: • MNLI: Multi-Genre Natural Language Inference • QQP: Quora Question Pairs • QNLI: Question Natural Language Inference • SST-2: Stanford Sentiment Treebank • CoLA: The corpus of Linguistic Acceptability • STS-B: The Semantic Textual Similarity Benchmark • MRPC: Microsoft Research Paraphrase Corpus • RTE: Recognizing Textual Entailment • WNLI: Winograd NLI 12/21/18 al+ AI Seminar No.7 28
  • 29. GLUE Results 12/21/18 al+ AI Seminar No.7 29
  • 30. SQuAD v1.1 12/21/18 al+ AI Seminar No.7 30 Reference: https://rajpurkar.github.io/SQuAD-explorer
  • 31. Named Entity Recognition 12/21/18 al+ AI Seminar No.7 31
  • 32. SWAG • The Situations with Adversarial Generations (SWAG) • The only task-specific parameters is a vector ! ∈ ℝ$ • The probability distribution is the softmax over the four choices %& = ()*+, ∑./0 1 ()*+, 12/21/18 al+ AI Seminar No.7 32
  • 33. SWAG Result 12/21/18 al+ AI Seminar No.7 33
  • 34. Ablation Studies • To understand • Effect of Pre-training Tasks • Effect of model sizes • Effect of number of training steps • Feature-based approach with BERT 12/21/18 al+ AI Seminar No.7 34
  • 35. Ablation Studies • Main findings: • “Next sentence prediction” (NSP) pre-training is important for sentence-pair classification task • Bidirectionality (using MLM pre-training task) contributes significantly to the performance improvement 12/21/18 al+ AI Seminar No.7 35
  • 36. Ablation Studies • Main findings • Bigger model sizes are better even for small-scale tasks 12/21/18 al+ AI Seminar No.7 36
  • 37. Conclusions • Unsupervised pre-training (pre-training language model) is increasingly adopted in many NLP tasks • Major contribution of the paper is to propose a deep bidirectional architecture from Transformer • Advance state-of-the-art for many important NLP tasks 12/21/18 al+ AI Seminar No.7 37
  • 38. Links • TensorFlow code and pre-trained models for BERT: https://github.com/google-research/bert • PyTorch Pretrained Bert: https://github.com/huggingface/pytorch- pretrained-BERT • BERT-pytorch: https://github.com/codertimo/BERT- pytorch • BERT-keras: https://github.com/Separius/BERT- keras 12/21/18 al+ AI Seminar No.7 38
  • 39. Remark: Applying BERT for non- English languages • Pre-trained BERT models are provided for more than 100 languages (including Vietnamese) • https://github.com/google- research/bert/blob/master/multilingual.md • Be careful with tokenization!! • For Japanese (and Chinese): “spaces were added around every character in the CJI Unicode rage before applying WordPiece” => Not a good way to do • Use SentencePiece: https://github.com/google/sentencepiece • We may need to pre-train BERT model 12/21/18 al+ AI Seminar No.7 39
  • 40. References 1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. 2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762 3. The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.ht ml, by harvardnlp. 4. Dissecting BERT: https://medium.com/dissecting-bert 12/21/18 al+ AI Seminar No.7 40