SlideShare une entreprise Scribd logo
1  sur  77
BERT
Bidirectional Encoder
Representations from
TransformersMultimodal Dialogue Systems Seminar
Abdallah Bashir
CS UdS
19 Sep 2019
Saarbrucken
Germany 1
2
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
3
Keywords!
● Transformers
● Contextual Embeddings
● Bidirectional Training
● Fine-Tuning
● SOTA, alot of SOTA!
4
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Introduction
● 2018 was a turning point for Natural Language
Processing
○ BERT
○ OpenAI GPT
○ ELMO
○ ULM fit
[1]
5
Introduction
● BERT [2] (Bidirectional Encoder Representations
from Transformers) is an Open-Source Language
Representation Model developed by researchers in
Google AI.
● At that time, the paper presented SOTA results in
eleven NLP tasks.
6Models that outperformed bert mentioned at the end.
Main Contributions
The paper summarizes the contributions in main three
points
● demonstrate the importance of Bidirectional pre-training
and why it is better than unidirectional approach.
● With the usage of fine-tuning, BERT outperforms task-
specific architectures.
● Showing the performance achievements where BERT
advances the SOTA for eleven NLP tasks 7
8
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Related Work
● The paper briefly go through previous approaches
in pre-training general language representations,
which are listed under main three points:
○ Unsupervised Feature-based Approaches
○ Unsupervised Fine-tuning Approaches
○ Transfer Learning from Supervised Data
9
Unsupervised Feature-based
Approaches
● Approaches to Learning representations from
unlabeled text, i.e word embeddings included non-
neural approaches and neural approaches.
● Using pre-trained word-embeddings instead of
training it from scratch have proved significant
improvements in performance.
10
11
The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on
for two hundred values. [1]
Non Contextual Embeddings
● Unidirectional
○ Feature-Based(ELMo)
○ Fine-tuning(OpenAI GPT).
● Bidirectional
○ BERT
Contextual Embeddings
12
ELMO, What about sentences’
context
13
14
[16]
15
● When ELMo contextual representations was used
in task-specific architecture, ELMo advanced the
SOTA benchmarks.
16
17
Transformer
Unsupervised Fine-tuning
Approaches
18
19
20
21
22
23
Self-Attention in a Nutshell!
Unsupervised Fine-tuning
Approaches
● Perks of using such approaches that only few
parameters need to be trained from scratch.
● OpenAI GPT [4] previously achieved SOTA on
many sentence level tasks in GLUE benchmark [19]
24
Transfer Learning from
Supervised Data
● Transfer learning is a means to extract knowledge
from a source setting and apply it to a different
target setting.
● Researchers in the field of Computer Vision have
used transfer learning continuously where they
fine-tune models pre-trained with ImageNet. 25
26
27
Outlines
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
● BERT implementation:
○ Pretraining
○ Fine-tuning
BERT | The Model
28
[16]
29
● BERT almost have the same model architecture
across different tasks with small changes between
the pre-trained architecture and the final
downstream architecture
30
Model Architecture
● BERT architecture consist of multi-layer
bidirectional Transformer encoder [5].
● BERT model provided in the paper came in two
sizes
BERTBASE BERTLARGE
Layers = 12 Layers = 24
Hidden size = 768 Hidden size = 1024
self-Attention heads = 12 self-Attention heads = 16
Total parameters = 110M Total parameters = 340M
31
● BERTbase was chosen to have the same size as
OpenAI GPT for benchmarks purposes.
● The difference between the two models is BERT
train the Language Model transformers in both
directions while the OpenAI GPT context trains it
only left-to-right.
Comparisons with OpenAI GPT
32
Model Input/Output
Representation
● The paper defines a ‘sentence’ as any sequence of
text. Like mentioned before, it could be one
sentence or two sentences stacked together.
● BERT input can be represented by either a single
sentence, or pair of sentences (like Question
Answering).
● BERT use WordPiece embeddings [23] with
30,000 token vocabulary. 33
[24]
34
● The first token in each sequence is always a
[CLS] token that is used in classification.
● Separating Sentences apart can be done in two
steps:
○ Separate between sentences using [SEP] token,
○ Adding a learned embedding to each token in order to
state which sentence it belongs to.
Model Input/Output
Representation
35
● Each token is represented by summing the
corresponding token, segment and position
embedding as seen in the above figure
36
37
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Pre-training BERT
● In order to pre-train BERT transformers
bidirectionally the paper proposed two
unsupervised tasks which are Masked LM and
Next Sentence Prediction(NSP).
38
Pre-training Data
● One of the reasons of BERT success is the
amount of data that it got trained on.
● The main two corpuses that were used to
pretrain the language models are:
○ the BooksCorpus (800M words)
○ English Wikipedia (2,500M words)
39
Masked Language Modeling
● Usually, Language Models can only be trained on
one specific direction.
● BERT handled this issue by using a “Masked
Language Model” from previous literature (Cloze
[24]).
● BERT do this by randomly masking 15% of the
wordpiece input tokens. Only the masked tokens
get to be predicted later.
40
[16]
41
● [MASK] tokens would appear only in the pretraining
and not during the fine-tuning!
● To alleviate the masking, after choosing the randomly
15% tokens, BERT either:
○ Switch the token to [MASK] (80% of the time).
○ Switch the token to a random other token (10% of the time).
○ Leave the token unchanged (10% of the time).
● The original token is then predicted with cross entropy
loss.
Masked Language Modeling
42
Next Sentence Prediction (NSP)
● In the pretraining phase, the model (if provided)
receives pairs of sentences and it will be trained to
predict if the second sentence is the subsequent of
the first.
● In the training data, 50% of the inputs are actual pairs
with a label [IsNext], where the other 50% have
random sentences from the corpus as the successor
for the first sentence [NotNext].
● NSP is crucial for downstream tasks like Question
Answering and Natural Language Inference. 43
Next Sentence Prediction (NSP)
44
● this stage in total is considered to be inexpensive
relative to the pretraining phase.
● For classification tasks (e.g sentiment analysis) we
add a classification FFN for the [CLS] token input
representation on top of the final output.
● For Question Answering alike tasks Bert train two
extra vectors that are responsible for marking the
beginning and the end of the answer.
Fine-tuning BERT
45
Classification Example
[16]
46
47
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Experiments
48
49
50
Experiments
● GLUE
○ General Language Understanding
Evaluation benchmark
● SQuAD v1.1
○ Stanford Question Answering Dataset
● SQuAD v2.0
○ extends the SQuAD 1.1
● SWAG
○ Situations With Adversarial Generations
(SWAG) dataset
1. GLUE
● “The General Language Understanding Evaluation
benchmark (GLUE) [19] is a collection of various
natural language understanding tasks.”
● For fine-tuning,the same approach of classification
in pre-training is used.
51
Table 1: GLUE Test results [2]
52
2. SQuAD v1.1
● “SQuAD v1.1 [26], The Stanford Question
Answering Dataset is a collection of 100k
crowdsourced question,answer pairs.”
● Input get represented as one sequence containing
the question and the text containing the answer.
53
Table 2: SQuAD 1.1 results
2. SQuAD v1.1
54
● The SQuAD 2.0 task extends the SQuAD 1.1 by:
○ making sure that no short answer exists in the
provided paragraph.
○ marking questions that do not have an answer
with [CLS] token at the start and the end. The
TriviaQA data was used for this model.
3. SQuAD v2.0
55
Table 3: SQuAD 2.0 results
56
● “The Situations With Adversarial Generations
(SWAG) dataset contains 113k sentence-pair
completion examples that evaluate grounded
common sense inference [28],
● Given a sentence, the task is to choose the most
plausible continuation among four choices”
● The paper fine-tune on the SWAG dataset by
creating four input sequences, each include the
given sentence and concatenated with a possible
continuation.
4. SWAG
57
Table 4: SWAG Result‫س‬
58
59
Outline
● Introduction
● Related Works
● BERT | The Model
● Pre-training BERT
● Experiments
● Ablation Studies
Ablation
Studies
60
1. Effect of Pre-training Tasks
● The next experiments will showcase the
importance of the bidirectionality of BERT.
● the same pretraining data, fine-tuning scheme, and
hyperparameters as BERTBASE will be used
throughout the experiments.
61
Experiments Description
No NSP bidirectional model trained only
using the Masked Language
Model (MLM) without the Next
Sentence Prediction NSP.
LTR & No NSP A standard Left-to-Right (LTR)
language model which is
trained without the MLM and
the NSP. this model can be
compared to the OpenAI GPT.
1. Effect of Pre-training Tasks
62
Table 5: Ablation over the pre-training tasks using the BERTBASE architecture [2]
63
● The effect of Bert model size on fine-tuning tasks
was tested with different number of layers, hidden
units, and attention heads while using the same
hyperparameters.
● Results from fine-tuning on GLUE are shown in
Table 6 which include the average Dev Set
accuracy.
● It is clear that the larger the model, the better the
accuracy.
2. Effect of Model Size
64
Table 6: Ablation over BERT model size [2]
65
● Bert shows for the first time that increasing the
model can improve the results even for
downstream tasks that its data is very small
because it benefit from the larger, more expressive
pre-trained representations.
66
3. Feature-based Approach with
BERT
● There are other ways to use Bert for downstream
tasks other than fine-tuning which is using the
contextualized word embeddings that are
generated from pre-training BERT, and then use
these fixed features in other models.
● Advantages of using such an approach:
○ Some tasks require a task-specific model architecture and
can’t be modeled with a Transformer encoder
architecture. 67
● The two approaches were compared by applying Bert to
the CoNLL-2003 Named Entity Recognition (NER) task.
● These contextual embeddings are used as input to a
randomly initialized two-layer 768-dimensional BiLSTM
before the classification layer.
● The paper tried different approaches to represent the
embeddings (all shown in Table 7.) in the feature-based
approach. Concatenate the last four-hidden layers
output achieved the best.
3. Feature-based Approach with
BERT
68
Table 7: CoNLL-2003 Named Entity Recognition results [1]
69
In Conclusion
● Bert major contribution was in adding more
generalization to existing Transfer Learning
methods by using bidirectional architecture.
● Bert model also added more contribution to the
field. Fine-tuning Bert model will now tackle a lot
of NLP tasks.
70
71
Papers that outperformed Bert
● Roberta, FAIR
● Xlnet, CMU + Google AI
● CTRL, Salesforce Research
72
References
1. http://jalammar.github.io/illustrated-bert/
2. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
3. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL.
4. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised learning. Technical report, OpenAI.
5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 6000–6010.
6. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.
1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–
479.
7. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from
multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.
73
8. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation
with structural correspondence learning. In Proceedings of the 2006 conference on empirical
methods in natural language processing, pages 120–128. Association for Computational
Linguistics.
9. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in
statistical language modeling. arXiv preprint arXiv:1312.3005.
10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Global vectors for word representation. In Empirical Methods in Natural Language
Processing (EMNLP), pages 1532– 1543.
11. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A
simple and general method for semi-supervised learning. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394.
12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural
information processing systems, pages 3294–3302.
13. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
learning sentence representations. In International Conference on Learning Representations.
74
14. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
documents. In International Conference on Machine Learning, pages 1188–1196.
15. Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based
objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.
16. http://jalammar.github.io/illustrated-bert/
17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances
in neural information processing systems, pages 3079–3087.
18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for
text classification. In ACL. Association for Computational Linguistics.
19. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language
understanding. In Proceedings of the 2018 EMNLP Workshop Black box NLP: Analyzing and
Interpreting Neural Networks for NLP, pages 353–355.
20. Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes.
2017. Supervised learning of universal sentence representations from natural language inference
data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
75
21. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in
translation: Contextualized word vectors. In NIPS.
22. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR09.
23. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation
system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
24. Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism
Bulletin, 30(4):415–433
25. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–
27.
26. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+
questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383–2392.
27. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale
distantly supervised challenge dataset for reading comprehension. In ACL.
76
28. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale
adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
29. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003
shared task: Language-independent named entity recognition. In CoNLL.
30. http://jalammar.github.io/illustrated-transformer/
77

Contenu connexe

Tendances

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUSri Geetha
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 

Tendances (20)

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Word embedding
Word embedding Word embedding
Word embedding
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 

Similaire à Bert

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 
Pay attention to MLPs
Pay attention to MLPsPay attention to MLPs
Pay attention to MLPsSeolhokim
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression SanjanaRajeshKothari
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingSubhajit Sahu
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewVimukthi Wickramasinghe
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff NotesHeemeng Foo
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Marsan Ma
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 

Similaire à Bert (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Pay attention to MLPs
Pay attention to MLPsPay attention to MLPs
Pay attention to MLPs
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper review
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff Notes
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)Universal job embedding in recommendation (public ver.)
Universal job embedding in recommendation (public ver.)
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
nnUNet
nnUNetnnUNet
nnUNet
 

Dernier

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Bert

  • 1. BERT Bidirectional Encoder Representations from TransformersMultimodal Dialogue Systems Seminar Abdallah Bashir CS UdS 19 Sep 2019 Saarbrucken Germany 1
  • 2. 2 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 3. 3 Keywords! ● Transformers ● Contextual Embeddings ● Bidirectional Training ● Fine-Tuning ● SOTA, alot of SOTA!
  • 4. 4 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 5. Introduction ● 2018 was a turning point for Natural Language Processing ○ BERT ○ OpenAI GPT ○ ELMO ○ ULM fit [1] 5
  • 6. Introduction ● BERT [2] (Bidirectional Encoder Representations from Transformers) is an Open-Source Language Representation Model developed by researchers in Google AI. ● At that time, the paper presented SOTA results in eleven NLP tasks. 6Models that outperformed bert mentioned at the end.
  • 7. Main Contributions The paper summarizes the contributions in main three points ● demonstrate the importance of Bidirectional pre-training and why it is better than unidirectional approach. ● With the usage of fine-tuning, BERT outperforms task- specific architectures. ● Showing the performance achievements where BERT advances the SOTA for eleven NLP tasks 7
  • 8. 8 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 9. Related Work ● The paper briefly go through previous approaches in pre-training general language representations, which are listed under main three points: ○ Unsupervised Feature-based Approaches ○ Unsupervised Fine-tuning Approaches ○ Transfer Learning from Supervised Data 9
  • 10. Unsupervised Feature-based Approaches ● Approaches to Learning representations from unlabeled text, i.e word embeddings included non- neural approaches and neural approaches. ● Using pre-trained word-embeddings instead of training it from scratch have proved significant improvements in performance. 10
  • 11. 11 The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on for two hundred values. [1] Non Contextual Embeddings
  • 12. ● Unidirectional ○ Feature-Based(ELMo) ○ Fine-tuning(OpenAI GPT). ● Bidirectional ○ BERT Contextual Embeddings 12
  • 13. ELMO, What about sentences’ context 13
  • 14. 14
  • 16. ● When ELMo contextual representations was used in task-specific architecture, ELMo advanced the SOTA benchmarks. 16
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 24. Unsupervised Fine-tuning Approaches ● Perks of using such approaches that only few parameters need to be trained from scratch. ● OpenAI GPT [4] previously achieved SOTA on many sentence level tasks in GLUE benchmark [19] 24
  • 25. Transfer Learning from Supervised Data ● Transfer learning is a means to extract knowledge from a source setting and apply it to a different target setting. ● Researchers in the field of Computer Vision have used transfer learning continuously where they fine-tune models pre-trained with ImageNet. 25
  • 26. 26
  • 27. 27 Outlines ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 28. ● BERT implementation: ○ Pretraining ○ Fine-tuning BERT | The Model 28
  • 30. ● BERT almost have the same model architecture across different tasks with small changes between the pre-trained architecture and the final downstream architecture 30
  • 31. Model Architecture ● BERT architecture consist of multi-layer bidirectional Transformer encoder [5]. ● BERT model provided in the paper came in two sizes BERTBASE BERTLARGE Layers = 12 Layers = 24 Hidden size = 768 Hidden size = 1024 self-Attention heads = 12 self-Attention heads = 16 Total parameters = 110M Total parameters = 340M 31
  • 32. ● BERTbase was chosen to have the same size as OpenAI GPT for benchmarks purposes. ● The difference between the two models is BERT train the Language Model transformers in both directions while the OpenAI GPT context trains it only left-to-right. Comparisons with OpenAI GPT 32
  • 33. Model Input/Output Representation ● The paper defines a ‘sentence’ as any sequence of text. Like mentioned before, it could be one sentence or two sentences stacked together. ● BERT input can be represented by either a single sentence, or pair of sentences (like Question Answering). ● BERT use WordPiece embeddings [23] with 30,000 token vocabulary. 33
  • 35. ● The first token in each sequence is always a [CLS] token that is used in classification. ● Separating Sentences apart can be done in two steps: ○ Separate between sentences using [SEP] token, ○ Adding a learned embedding to each token in order to state which sentence it belongs to. Model Input/Output Representation 35
  • 36. ● Each token is represented by summing the corresponding token, segment and position embedding as seen in the above figure 36
  • 37. 37 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 38. Pre-training BERT ● In order to pre-train BERT transformers bidirectionally the paper proposed two unsupervised tasks which are Masked LM and Next Sentence Prediction(NSP). 38
  • 39. Pre-training Data ● One of the reasons of BERT success is the amount of data that it got trained on. ● The main two corpuses that were used to pretrain the language models are: ○ the BooksCorpus (800M words) ○ English Wikipedia (2,500M words) 39
  • 40. Masked Language Modeling ● Usually, Language Models can only be trained on one specific direction. ● BERT handled this issue by using a “Masked Language Model” from previous literature (Cloze [24]). ● BERT do this by randomly masking 15% of the wordpiece input tokens. Only the masked tokens get to be predicted later. 40
  • 42. ● [MASK] tokens would appear only in the pretraining and not during the fine-tuning! ● To alleviate the masking, after choosing the randomly 15% tokens, BERT either: ○ Switch the token to [MASK] (80% of the time). ○ Switch the token to a random other token (10% of the time). ○ Leave the token unchanged (10% of the time). ● The original token is then predicted with cross entropy loss. Masked Language Modeling 42
  • 43. Next Sentence Prediction (NSP) ● In the pretraining phase, the model (if provided) receives pairs of sentences and it will be trained to predict if the second sentence is the subsequent of the first. ● In the training data, 50% of the inputs are actual pairs with a label [IsNext], where the other 50% have random sentences from the corpus as the successor for the first sentence [NotNext]. ● NSP is crucial for downstream tasks like Question Answering and Natural Language Inference. 43
  • 45. ● this stage in total is considered to be inexpensive relative to the pretraining phase. ● For classification tasks (e.g sentiment analysis) we add a classification FFN for the [CLS] token input representation on top of the final output. ● For Question Answering alike tasks Bert train two extra vectors that are responsible for marking the beginning and the end of the answer. Fine-tuning BERT 45
  • 47. 47 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 49. 49
  • 50. 50 Experiments ● GLUE ○ General Language Understanding Evaluation benchmark ● SQuAD v1.1 ○ Stanford Question Answering Dataset ● SQuAD v2.0 ○ extends the SQuAD 1.1 ● SWAG ○ Situations With Adversarial Generations (SWAG) dataset
  • 51. 1. GLUE ● “The General Language Understanding Evaluation benchmark (GLUE) [19] is a collection of various natural language understanding tasks.” ● For fine-tuning,the same approach of classification in pre-training is used. 51
  • 52. Table 1: GLUE Test results [2] 52
  • 53. 2. SQuAD v1.1 ● “SQuAD v1.1 [26], The Stanford Question Answering Dataset is a collection of 100k crowdsourced question,answer pairs.” ● Input get represented as one sequence containing the question and the text containing the answer. 53
  • 54. Table 2: SQuAD 1.1 results 2. SQuAD v1.1 54
  • 55. ● The SQuAD 2.0 task extends the SQuAD 1.1 by: ○ making sure that no short answer exists in the provided paragraph. ○ marking questions that do not have an answer with [CLS] token at the start and the end. The TriviaQA data was used for this model. 3. SQuAD v2.0 55
  • 56. Table 3: SQuAD 2.0 results 56
  • 57. ● “The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded common sense inference [28], ● Given a sentence, the task is to choose the most plausible continuation among four choices” ● The paper fine-tune on the SWAG dataset by creating four input sequences, each include the given sentence and concatenated with a possible continuation. 4. SWAG 57
  • 58. Table 4: SWAG Result‫س‬ 58
  • 59. 59 Outline ● Introduction ● Related Works ● BERT | The Model ● Pre-training BERT ● Experiments ● Ablation Studies
  • 61. 1. Effect of Pre-training Tasks ● The next experiments will showcase the importance of the bidirectionality of BERT. ● the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE will be used throughout the experiments. 61
  • 62. Experiments Description No NSP bidirectional model trained only using the Masked Language Model (MLM) without the Next Sentence Prediction NSP. LTR & No NSP A standard Left-to-Right (LTR) language model which is trained without the MLM and the NSP. this model can be compared to the OpenAI GPT. 1. Effect of Pre-training Tasks 62
  • 63. Table 5: Ablation over the pre-training tasks using the BERTBASE architecture [2] 63
  • 64. ● The effect of Bert model size on fine-tuning tasks was tested with different number of layers, hidden units, and attention heads while using the same hyperparameters. ● Results from fine-tuning on GLUE are shown in Table 6 which include the average Dev Set accuracy. ● It is clear that the larger the model, the better the accuracy. 2. Effect of Model Size 64
  • 65. Table 6: Ablation over BERT model size [2] 65
  • 66. ● Bert shows for the first time that increasing the model can improve the results even for downstream tasks that its data is very small because it benefit from the larger, more expressive pre-trained representations. 66
  • 67. 3. Feature-based Approach with BERT ● There are other ways to use Bert for downstream tasks other than fine-tuning which is using the contextualized word embeddings that are generated from pre-training BERT, and then use these fixed features in other models. ● Advantages of using such an approach: ○ Some tasks require a task-specific model architecture and can’t be modeled with a Transformer encoder architecture. 67
  • 68. ● The two approaches were compared by applying Bert to the CoNLL-2003 Named Entity Recognition (NER) task. ● These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer. ● The paper tried different approaches to represent the embeddings (all shown in Table 7.) in the feature-based approach. Concatenate the last four-hidden layers output achieved the best. 3. Feature-based Approach with BERT 68
  • 69. Table 7: CoNLL-2003 Named Entity Recognition results [1] 69
  • 70. In Conclusion ● Bert major contribution was in adding more generalization to existing Transfer Learning methods by using bidirectional architecture. ● Bert model also added more contribution to the field. Fine-tuning Bert model will now tackle a lot of NLP tasks. 70
  • 71. 71 Papers that outperformed Bert ● Roberta, FAIR ● Xlnet, CMU + Google AI ● CTRL, Salesforce Research
  • 72. 72
  • 73. References 1. http://jalammar.github.io/illustrated-bert/ 2. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In NAACL. 4. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI. 5. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010. 6. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467– 479. 7. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853. 73
  • 74. 8. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128. Association for Computational Linguistics. 9. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. 10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532– 1543. 11. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394. 12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302. 13. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations. 74
  • 75. 14. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196. 15. Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557. 16. http://jalammar.github.io/illustrated-bert/ 17. Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087. 18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics. 19. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Black box NLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355. 20. Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics. 75
  • 76. 21. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS. 22. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09. 23. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. 24. Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433 25. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19– 27. 26. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. 27. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL. 76
  • 77. 28. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 29. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL. 30. http://jalammar.github.io/illustrated-transformer/ 77

Notes de l'éditeur

  1. Transformers: a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.
  2. Models that outperformed bert mentioned at the end
  3. ELMo [3] extract context features by using a bi-directional LSTM language model. Each token is represented by concatenating left-to-right and right-to-left representations.
  4. grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation)
  5. Transformers: a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Openai and bert depends on it.
  6. by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
  7. Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
  8. The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
  9. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence
  10. an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
  11. Say the following sentence is an input sentence we want to translate:”The animal didn't cross the street because it was too tired” What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm. When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”. As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
  12. 4:openai gpt 17.Semi-supervised sequence learning. In Advances in neural information processing systems, 18.Universal language model fine-tuning for text classificatio
  13. Previous work ,Computer vision
  14. input/output, transformers
  15. Pretraining: the model at first is trained on large unlabeled data. Fine-tuning: model is initialised with the embeddings that was trained in the previous step. Afterwards all the parameters are fine-tuned using the labeled data from the chosen downstream task.
  16. Two aizes
  17. Mention roberta
  18. The Next Sentence Prediction NSP task in the paper is related to [13] and [15], the only difference that [13] and [15] transfer only sentence embeddings to downstream tasks where BERT transfer all the parameters to the various NLP downstream tasks.
  19. In the next slides BERT fine-tuning results on 11 NLP tasks will be discussed
  20. Fine-tuning: we go through the same steps as earlier, the final hidden vector C which corresponds to the first input token [CLS] and get passed to a FFN which is used for classification
  21. We can see from Table 1. That both BERTBASE and BERTLARGE outperform all previous state-of-the-art with 4.5% and 7.0% respective average accuracy improvements. Another finding that BERTLARGE outperforms BERTBASE across all tasks
  22. Probability of wordi Pi to be the first of the sentence is calculated by using softmax over all other words in the text sequence Ti : the output vector S: start vector The same formula is used to predict the word that is at the end of the sentence
  23. Bert Large achieved SOTA in performance in the SQuAD. The model that achieved the highest score was ensemble of Bertlarge model with augmenting the dataset with TriviaQA [27] before fine-tuning with SQuAD. “The best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system.”
  24. From Table 3. We can see that Bert achieved a +5.1 F1 improvement over the previous best systems
  25. “ BERTLARGE outperforms the authors’ baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%”
  26. The paper have performed numerous useful ablation studies over different aspects of the Bert architecture in order to understand why are they important.
  27. But bert was trained with larger training dataset, our input representation, and our fine-tuning scheme.
  28. In Table 5. We can see that removing NSP reduces the performance especially in QNLI, MNLI, and SQuAD 1.1.
  29. #L = the number of layers; #H = hidden size; #A = number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data. Using BERT large improved performance from BERT base in GLUE selected tasks even if BERT base already had a great number of parameters (110M) compared to the largest tested model in Transformer (100M).
  30. Computational efficient. Using the already pre-trained contextual embeddings with relatively small models is considered to be less expensive than fine-tuning.
  31. BERTLARGE performs competitively with state-of-the-art methods the best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model This demonstrates that BERT is effective for both finetuning and feature-based approaches.