SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
BERT: Pre-training of Deep
Bidirectional Transformers for
Language Understanding
Devlin et al., 2018 (Google AI Language)
Phạm Quang Nhật Minh
NLP Researcher
Alt Vietnam
al+ AI Seminar No. 7
• Research context
• Main ideas
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 2
Research context
• Language model pre-training has been used to
improve many NLP tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained
language representations to downstream tasks
• Feature-based: include pre-trained representations as
additional features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and
fine-tune the pre-trained parameters (e.g., OpenAI GPT,
12/21/18 al+ AI Seminar No.7 3
Limitations of current techniques
• Language models in pre-training are unidirectional,
they restrict the power of the pre-trained
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language
• Solution BERT: Bidirectional Encoder
Representations from Transformers
12/21/18 al+ AI Seminar No.7 4
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
12/21/18 al+ AI Seminar No.7 5
12/21/18 al+ AI Seminar No.7 6
BERT: Bidirectional Encoder
Representations from Transformers
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
12/21/18 al+ AI Seminar No.7 7
12/21/18 al+ AI Seminar No.7 8
Differences in pre-training model architectures: BERT,
OpenAI GPT, and ELMo
Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
12/21/18 al+ AI Seminar No.7 9
Vaswani et al. (2017) Attention is all you need
Encoder Block
Encoder Block
Encoder Block
Input sequence
Inside an Encoder Block
12/21/18 al+ AI Seminar No.7 10
In BERT experiments, the number
of blocks N was chosen to be 12
and 24.
Blocks do not share weights with
each other
Transformer Encoders: Key Concepts
12/21/18 al+ AI Seminar No.7 11
Feed Forward
12/21/18 al+ AI Seminar No.7 12
Image source:
Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
12/21/18 al+ AI Seminar No.7 13
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
Attention ', ), * = softmax
* dk is the dimension of
key vectors
Multi-Head Attention
12/21/18 al+ AI Seminar No.7 14
Head #0 Head #1 Head #7
Linear Projection
Use a weight
matrix Wo
Position Encoding
• Position Encoding is used to make use of the order of the
• Since the model contains no recurrence and no convolution
• In Vawasni et al., 2017, authors used sine and cosine functions of
different frequencies
!"($%&,()) = sin
!"($%&,()56) = cos
• pos is the position and 9 is the dimension
12/21/18 al+ AI Seminar No.7 15
Input Representation
12/21/18 al+ AI Seminar No.7 16
• Token Embeddings: Use pretrained WordPiece embeddings
• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
Task#1: Masked LM
• 15% of the words are masked at random
• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way
(example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
• 10% were replaced by a random token: “My dog is
• 10% were left intact: “My dog is hairy”
12/21/18 al+ AI Seminar No.7 17
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
• The task is pre-training binarized next sentence
prediction task
12/21/18 al+ AI Seminar No.7 18
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext
Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are
flight ##less birds [SEP]
Label = NotNext
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• To generate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood
12/21/18 al+ AI Seminar No.7 19
Fine-tuning procedure
• For sequence-level classification task
• Obtain the representation of the input sequence by using the
final hidden state (hidden state at the position of the special
token [CLS]) ! ∈ #$
• Just add a classification layer and use softmax to calculate
label probabilities. Parameters W ∈ #&×$
( = softmax(!23
12/21/18 al+ AI Seminar No.7 20
Fine-tuning procedure
• For sequence-level classification task
• All of the parameters of BERT and W are fine-tuned
• Most model hyperparameters are the same as in
• except the batch size, learning rate, and number of
training epochs
12/21/18 al+ AI Seminar No.7 21
Fine-tuning procedure
• Token tagging task (e.g., Named Entity Recognition)
• Feed the final hidden representation !" ∈ $% for each
token & into a classification layer for the tagset (NER
label set)
• To make the task compatible with WordPiece
• Predict the tag for the first sub-token of a word
• No prediction is made for X
12/21/18 al+ AI Seminar No.7 22
Fine-tuning procedure
12/21/18 al+ AI Seminar No.7 23
Single Sentence Tagging Tasks: CoNLL-2003 NER
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Input Question:
Where do water droplets collide with ice
crystals to form precipitation?
• Input Paragraph:
.... Precipitation forms as smaller
dropletscoalesce via collision with other rain
dropsor ice crystals within a cloud. ...
• Output Answer:
within a cloud
12/21/18 al+ AI Seminar No.7 24
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Represent the input question and paragraph as a single
packed sequence
• The question uses the A embedding and the paragraph uses
the B embedding
• New parameters to be learned in fine-tuning are start
vector ! ∈ ℝ$ and end vector % ∈ ℝ$
• Calculate the probability of word & being the start of the
answer span
'( =
∑/ *+,-"
• The training objective is the log-likelihood the correct
and end positions
12/21/18 al+ AI Seminar No.7 25
Comparison of BERT and OpenAI GPT
Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) +
Wikipedia (2,500M)
Use sentence separater ([SEP])
and classifier token ([CLS]) only at
fine-tuning time
BERT learns [SEP], [CLS] and
sentence A/B embeddings during
Trained for 1M steps with a batch-
size of 32,000 words
Trained for 1M steps with a batch-
size of 128,000 words
Use the same learning rate of 5e-5
for all fine-tuning experiments
BERT choose a task-specific
learning rate which performs the
best on the development set
12/21/18 al+ AI Seminar No.7 26
• Research context
• Main ideas
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 27
• GLUE (General Language Understanding Evaluation)
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
12/21/18 al+ AI Seminar No.7 28
GLUE Results
12/21/18 al+ AI Seminar No.7 29
SQuAD v1.1
12/21/18 al+ AI Seminar No.7 30
Named Entity Recognition
12/21/18 al+ AI Seminar No.7 31
• The Situations with Adversarial Generations (SWAG)
• The only task-specific parameters is a vector ! ∈ ℝ$
• The probability distribution is the softmax over the four
%& =
12/21/18 al+ AI Seminar No.7 32
SWAG Result
12/21/18 al+ AI Seminar No.7 33
Ablation Studies
• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT
12/21/18 al+ AI Seminar No.7 34
Ablation Studies
• Main findings:
• “Next sentence prediction” (NSP) pre-training is important
for sentence-pair classification task
• Bidirectionality (using MLM pre-training task) contributes
significantly to the performance improvement
12/21/18 al+ AI Seminar No.7 35
Ablation Studies
• Main findings
• Bigger model sizes are better even for small-scale tasks
12/21/18 al+ AI Seminar No.7 36
• Unsupervised pre-training (pre-training language
model) is increasingly adopted in many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks
12/21/18 al+ AI Seminar No.7 37
• TensorFlow code and pre-trained models for BERT:
• PyTorch Pretrained Bert:
• BERT-pytorch:
• BERT-keras:
12/21/18 al+ AI Seminar No.7 38
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than
100 languages (including Vietnamese)
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added
around every character in the CJI Unicode rage before
applying WordPiece” => Not a good way to do
• Use SentencePiece:
• We may need to pre-train BERT model
12/21/18 al+ AI Seminar No.7 39
1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv
preprint arXiv:1706.03762.
3. The Annotated Transformer:
ml, by harvardnlp.
4. Dissecting BERT:
12/21/18 al+ AI Seminar No.7 40

Contenu connexe


NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj

Tendances (20)

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer

Similaire à BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...DataScienceConferenc1
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxSan Kim
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesSchwannden Kuo
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsBithiah Yuan
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang

Similaire à BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architect...
Python for IoT
Python for IoTPython for IoT
Python for IoT
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
Compeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptxCompeition-Level Code Generation with AlphaCode.pptx
Compeition-Level Code Generation with AlphaCode.pptx
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem

Plus de Minh Pham

Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTPrompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTMinh Pham
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...Minh Pham
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...Minh Pham
Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Minh Pham
Giới thiệu về AIML
Giới thiệu về AIMLGiới thiệu về AIML
Giới thiệu về AIMLMinh Pham
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMinh Pham
Deep Contexualized Representation
Deep Contexualized RepresentationDeep Contexualized Representation
Deep Contexualized RepresentationMinh Pham
Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Minh Pham
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...Minh Pham
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017Minh Pham
Research Methods in Natural Language Processing
Research Methods in Natural Language ProcessingResearch Methods in Natural Language Processing
Research Methods in Natural Language ProcessingMinh Pham
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotCác bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotMinh Pham
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham

Plus de Minh Pham (13)

Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPTPrompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)Research methods for engineering students (v.2020)
Research methods for engineering students (v.2020)
Giới thiệu về AIML
Giới thiệu về AIMLGiới thiệu về AIML
Giới thiệu về AIML
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiênMạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Deep Contexualized Representation
Deep Contexualized RepresentationDeep Contexualized Representation
Deep Contexualized Representation
Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)Research Methods in Natural Language Processing (2018 version)
Research Methods in Natural Language Processing (2018 version)
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017Về kỹ thuật Attention trong mô hình sequence-to-sequence  tại hội nghị ACL 2017
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Research Methods in Natural Language Processing
Research Methods in Natural Language ProcessingResearch Methods in Natural Language Processing
Research Methods in Natural Language Processing
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbotCác bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing


Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions

Dernier (20)

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al., 2018 (Google AI Language) Presenter Phạm Quang Nhật Minh NLP Researcher Alt Vietnam al+ AI Seminar No. 7 2018/12/21
  • 2. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 2
  • 3. Research context • Language model pre-training has been used to improve many NLP tasks • ELMo (Peters et al., 2018) • OpenAI GPT (Radford et al., 2018) • ULMFit (Howard and Rudder, 2018) • Two existing strategies for applying pre-trained language representations to downstream tasks • Feature-based: include pre-trained representations as additional features (e.g., ELMo) • Fine-tunning: introduce task-specific parameters and fine-tune the pre-trained parameters (e.g., OpenAI GPT, ULMFit) 12/21/18 al+ AI Seminar No.7 3
  • 4. Limitations of current techniques • Language models in pre-training are unidirectional, they restrict the power of the pre-trained representations • OpenAI GPT used left-to-right architecture • ELMo concatenates forward and backward language models • Solution BERT: Bidirectional Encoder Representations from Transformers 12/21/18 al+ AI Seminar No.7 4
  • 5. BERT: Bidirectional Encoder Representations from Transformers • Main ideas • Propose a new pre-training objective so that a deep bidirectional Transformer can be trained • The “masked language model” (MLM): the objective is to predict the original word of a masked word based only on its context • ”Next sentence prediction” • Merits of BERT • Just fine-tune BERT model for specific tasks to achieve state-of-the-art performance • BERT advances the state-of-the-art for eleven NLP tasks 12/21/18 al+ AI Seminar No.7 5
  • 6. 12/21/18 al+ AI Seminar No.7 6 BERT: Bidirectional Encoder Representations from Transformers
  • 7. Model architecture • BERT’s model architecture is a multi-layer bidirectional Transformer encoder • (Vaswani et al., 2017) “Attention is all you need” • Two models with different sizes were investigated • BERTBASE: L=12, H=768, A=12, Total Parameters=110M • (L: number of layers (Transformer blocks), H is the hidden size, A: the number of self-attention heads) • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M 12/21/18 al+ AI Seminar No.7 7
  • 8. 12/21/18 al+ AI Seminar No.7 8 Differences in pre-training model architectures: BERT, OpenAI GPT, and ELMo
  • 9. Transformer Encoders • Transformer is an attention-based architecture for NLP • Transformer composed of two parts: Encoding component and Decoding component • BERT is a multi-layer bidirectional Transformer encoder 12/21/18 al+ AI Seminar No.7 9 Vaswani et al. (2017) Attention is all you need Encoder Block Encoder Block Encoder Block Input sequence
  • 10. Inside an Encoder Block 12/21/18 al+ AI Seminar No.7 10 Source: bert/dissecting-bert-part-1-d3c3d495cdb3 In BERT experiments, the number of blocks N was chosen to be 12 and 24. Blocks do not share weights with each other
  • 11. Transformer Encoders: Key Concepts 12/21/18 al+ AI Seminar No.7 11 Multi-head self-attention Self- attention Transformer Encoders Position Encoding Layer NormalizationResidual Connections Position-wise Feed Forward Network
  • 12. Self-Attention 12/21/18 al+ AI Seminar No.7 12 Image source:
  • 13. Self-Attention in Detail • Attention maps a query and a set of key-value pairs to an output • query, keys, and output are all vectors 12/21/18 al+ AI Seminar No.7 13 Input Queries Keys Values X1 X2 q1 q2 k1 k2 v1 v2 Use matrices WQ , WK and WV to project input into query, key and value vectors Attention ', ), * = softmax ')1 23 * dk is the dimension of key vectors
  • 14. Multi-Head Attention 12/21/18 al+ AI Seminar No.7 14 X1 X2 ... Head #0 Head #1 Head #7 Concat Linear Projection Use a weight matrix Wo
  • 15. Position Encoding • Position Encoding is used to make use of the order of the sequence • Since the model contains no recurrence and no convolution • In Vawasni et al., 2017, authors used sine and cosine functions of different frequencies !"($%&,()) = sin /01 10000 () 4!"#$% !"($%&,()56) = cos /01 10000 () 4!"#$% • pos is the position and 9 is the dimension 12/21/18 al+ AI Seminar No.7 15
  • 16. Input Representation 12/21/18 al+ AI Seminar No.7 16 • Token Embeddings: Use pretrained WordPiece embeddings • Position Embeddings: Use learned Position Embeddings • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017. • Added sentence embedding to every tokens of each sentence • Use [CLS] for the classification tasks • Separate sentences by using a special token [SEP]
  • 17. Task#1: Masked LM • 15% of the words are masked at random • and the task is to predict the masked words based on its left and right context • Not all tokens were masked in the same way (example sentence “My dog is hairy”) • 80% were replaced by the <MASK> token: “My dog is <MASK>” • 10% were replaced by a random token: “My dog is apple” • 10% were left intact: “My dog is hairy” 12/21/18 al+ AI Seminar No.7 17
  • 18. Task#2: Next Sentence Prediction • Motivation • Many downstream tasks are based on understanding the relationship between two text sentences • Question Answering (QA) and Natural Language Inference (NLI) • Language modeling does not directly capture that relationship • The task is pre-training binarized next sentence prediction task 12/21/18 al+ AI Seminar No.7 18 Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = isNext Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are flight ##less birds [SEP] Label = NotNext
  • 19. Pre-training procedure • Training data: BooksCorpus (800M words) + English Wikipedia (2,500M words) • To generate each training input sequences: sample two spans of text (A and B) from the corpus • The combined length is ≤ 500 tokens • 50% B is the actual next sentence that follows A and 50% of the time it is a random sentence from the corpus • The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood 12/21/18 al+ AI Seminar No.7 19
  • 20. Fine-tuning procedure • For sequence-level classification task • Obtain the representation of the input sequence by using the final hidden state (hidden state at the position of the special token [CLS]) ! ∈ #$ • Just add a classification layer and use softmax to calculate label probabilities. Parameters W ∈ #&×$ ( = softmax(!23 ) 12/21/18 al+ AI Seminar No.7 20
  • 21. Fine-tuning procedure • For sequence-level classification task • All of the parameters of BERT and W are fine-tuned jointly • Most model hyperparameters are the same as in pre-training • except the batch size, learning rate, and number of training epochs 12/21/18 al+ AI Seminar No.7 21
  • 22. Fine-tuning procedure • Token tagging task (e.g., Named Entity Recognition) • Feed the final hidden representation !" ∈ $% for each token & into a classification layer for the tagset (NER label set) • To make the task compatible with WordPiece tokenization • Predict the tag for the first sub-token of a word • No prediction is made for X 12/21/18 al+ AI Seminar No.7 22
  • 23. Fine-tuning procedure 12/21/18 al+ AI Seminar No.7 23 Single Sentence Tagging Tasks: CoNLL-2003 NER
  • 24. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Input Question: Where do water droplets collide with ice crystals to form precipitation? • Input Paragraph: .... Precipitation forms as smaller dropletscoalesce via collision with other rain dropsor ice crystals within a cloud. ... • Output Answer: within a cloud 12/21/18 al+ AI Seminar No.7 24
  • 25. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Represent the input question and paragraph as a single packed sequence • The question uses the A embedding and the paragraph uses the B embedding • New parameters to be learned in fine-tuning are start vector ! ∈ ℝ$ and end vector % ∈ ℝ$ • Calculate the probability of word & being the start of the answer span '( = *+,-! ∑/ *+,-" • The training objective is the log-likelihood the correct and end positions 12/21/18 al+ AI Seminar No.7 25
  • 26. Comparison of BERT and OpenAI GPT OpenAI GPT BERT Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) + Wikipedia (2,500M) Use sentence separater ([SEP]) and classifier token ([CLS]) only at fine-tuning time BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training Trained for 1M steps with a batch- size of 32,000 words Trained for 1M steps with a batch- size of 128,000 words Use the same learning rate of 5e-5 for all fine-tuning experiments BERT choose a task-specific learning rate which performs the best on the development set 12/21/18 al+ AI Seminar No.7 26
  • 27. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 27
  • 28. Experiments • GLUE (General Language Understanding Evaluation) benchmark • Distribute canonical Train, Dev and Test splits • Labels for Test set are not provided • Datasets in GLUE: • MNLI: Multi-Genre Natural Language Inference • QQP: Quora Question Pairs • QNLI: Question Natural Language Inference • SST-2: Stanford Sentiment Treebank • CoLA: The corpus of Linguistic Acceptability • STS-B: The Semantic Textual Similarity Benchmark • MRPC: Microsoft Research Paraphrase Corpus • RTE: Recognizing Textual Entailment • WNLI: Winograd NLI 12/21/18 al+ AI Seminar No.7 28
  • 29. GLUE Results 12/21/18 al+ AI Seminar No.7 29
  • 30. SQuAD v1.1 12/21/18 al+ AI Seminar No.7 30 Reference:
  • 31. Named Entity Recognition 12/21/18 al+ AI Seminar No.7 31
  • 32. SWAG • The Situations with Adversarial Generations (SWAG) • The only task-specific parameters is a vector ! ∈ ℝ$ • The probability distribution is the softmax over the four choices %& = ()*+, ∑./0 1 ()*+, 12/21/18 al+ AI Seminar No.7 32
  • 33. SWAG Result 12/21/18 al+ AI Seminar No.7 33
  • 34. Ablation Studies • To understand • Effect of Pre-training Tasks • Effect of model sizes • Effect of number of training steps • Feature-based approach with BERT 12/21/18 al+ AI Seminar No.7 34
  • 35. Ablation Studies • Main findings: • “Next sentence prediction” (NSP) pre-training is important for sentence-pair classification task • Bidirectionality (using MLM pre-training task) contributes significantly to the performance improvement 12/21/18 al+ AI Seminar No.7 35
  • 36. Ablation Studies • Main findings • Bigger model sizes are better even for small-scale tasks 12/21/18 al+ AI Seminar No.7 36
  • 37. Conclusions • Unsupervised pre-training (pre-training language model) is increasingly adopted in many NLP tasks • Major contribution of the paper is to propose a deep bidirectional architecture from Transformer • Advance state-of-the-art for many important NLP tasks 12/21/18 al+ AI Seminar No.7 37
  • 38. Links • TensorFlow code and pre-trained models for BERT: • PyTorch Pretrained Bert: pretrained-BERT • BERT-pytorch: pytorch • BERT-keras: keras 12/21/18 al+ AI Seminar No.7 38
  • 39. Remark: Applying BERT for non- English languages • Pre-trained BERT models are provided for more than 100 languages (including Vietnamese) • research/bert/blob/master/ • Be careful with tokenization!! • For Japanese (and Chinese): “spaces were added around every character in the CJI Unicode rage before applying WordPiece” => Not a good way to do • Use SentencePiece: • We may need to pre-train BERT model 12/21/18 al+ AI Seminar No.7 39
  • 40. References 1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. 2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. 3. The Annotated Transformer: ml, by harvardnlp. 4. Dissecting BERT: 12/21/18 al+ AI Seminar No.7 40