SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Sebastian Ruder

Research Scientist, AYLIEN
PhD Candidate, Insight Centre, Dublin
@seb_ruder |31.05.17 | NLP Copenhagen
Transfer Learning for NLP
Agenda
1. What is Transfer Learning?

2. Why Transfer Learning now?

3. Transfer Learning in practice

4. Transfer Learning for NLP

5. Current research
@seb_ruder |31.05.17 | NLP Copenhagen
What is Transfer Learning?
@seb_ruder |
Model A Model B
Task / domain A
Task / domain B
Traditional ML
31.05.17 | NLP Copenhagen
What is Transfer Learning?
@seb_ruder |
Model A Model B
Task / domain A
Task / domain B
Traditional ML
Training and
evaluation on the same
task or domain.
31.05.17 | NLP Copenhagen
What is Transfer Learning?
@seb_ruder |
Knowledge
Model
Source task /
domain Target task /
domain
Transfer learning
Model
31.05.17 | NLP Copenhagen
What is Transfer Learning?
@seb_ruder |
Knowledge
Model
Source task /
domain Target task /
domain
Transfer learning
Storing knowledge gained solving
one problem and applying it to a
different but related problem.
Model
31.05.17 | NLP Copenhagen
@seb_ruder |
“Transfer learning
will be the next
driver of ML
success.”
Andrew Ng,
NIPS 2016 tutorial
Why Transfer Learning now?
@seb_ruder |
Supervised learning
Transfer learning
Unsupervised learning
Reinforcement learning
2016Time
Commercial
success
Drivers of ML success in industry
- Andrew Ng, NIPS 2016 tutorial
31.05.17 | NLP Copenhagen
Why Transfer Learning now?
@seb_ruder |
1. Learn very accurate input-output mapping
2. Maturity of ML models
- Computer vision (5% error on ImageNet)
-Automatic speech recognition (3x faster than
typing, 20% more accurate1)
3. Large-scale deployment & adoption of ML models
-Google’s NMT System2
1 Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. (2016). Speech Is 3x Faster than Typing for English
and Mandarin Text Entry on Mobile Devices. arXiv preprint arXiv:1608.07323.
2 Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural
Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint
arXiv:1609.08144.
Huge reliance on labeled data 

Novel tasks / domains without (labeled) data
31.05.17 | NLP Copenhagen
Transfer Learning in practice
@seb_ruder |
• Train new model on features
of large model trained on
ImageNet3
• Train model to confuse source
and target domains4
• Train model on domain-
invariant representations5,6
3 Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for
recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 512–519.
4 Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. Proceedings of the 32nd
International Conference on Machine Learning., 37.
5 Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain Separation Networks. NIPS 2016.
6 Sener, O., Song, H. O., Saxena, A., & Savarese, S. (2016). Learning Transferrable Representations for Unsupervised Domain
Adaptation. NIPS 2016.
Computer vision
31.05.17 | NLP Copenhagen
Transfer Learning for NLP
@seb_ruder |
• Task and domainT D
DS 6= DT TS 6= TT
A more technical definition
• Domain where
- : feature space, e.g. BOW representations
- : e.g. distribution over terms in documents
D = {X, P(X)}
X
P(X)
• Task where
- : label space, e.g. true/false labels
- : learned mapping from samples to labels
T = {Y, P(Y |X)}
Y
P(Y |X)
• Transfer learning:

Learning when or
31.05.17 | NLP Copenhagen
Transfer Learning for NLP
@seb_ruder |
Transfer scenarios
1. : Different topics, text types, etc.

2. : Different languages.

3. : Unbalanced classes.

4. : Different tasks.
P(XS) 6= P(XT )
XS 6= XT
P(YS|XS) 6= P(YT |XT )
YS 6= YT
31.05.17 | NLP Copenhagen
Transfer Learning for NLP
@seb_ruder |
Transfer scenarios
1. : Different topics, text types, etc.

2. : Different languages.

3. : Unbalanced classes.

4. : Different tasks.
P(XS) 6= P(XT )
XS 6= XT
P(YS|XS) 6= P(YT |XT )
YS 6= YT
31.05.17 | NLP Copenhagen
Transfer Learning for NLP
@seb_ruder |
Transfer scenarios
1. : Different topics, text types, etc.

2. : Different languages.

3. : Unbalanced classes.

4. : Different tasks.
P(XS) 6= P(XT )
XS 6= XT
P(YS|XS) 6= P(YT |XT )
YS 6= YT
31.05.17 | NLP Copenhagen
Domain adaptation
Transfer Learning for NLP
@seb_ruder |
Training and test distributions are different.
Different text types. Different accents/ages.
Different topics/categories.
Performance drop or even collapse is inevitable.
31.05.17 | NLP Copenhagen
Transfer Learning for NLP
@seb_ruder |
Current status
• Not as straightforward as in CV
- No universal deep features
• However: “Simple” transfer through word
embeddings is pervasive
• History of research for task-specific transfer, e.g.
sentiment analysis, POS tagging leveraging NLP
phenomena such as structured features, sentiment
words, etc.
• Few research on transfer between tasks
• More recently: research on representations
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
Transfer learning challenges in real-world scenarios
1. One-to-one adaptation is rare and many source
domains are generally available.
2. Models need to be adapted frequently as
conditions change, new data becomes available, etc.

3. Target domains may be unknown or no target
domain data may be available.
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
Two recent examples
1. Sebastian Ruder, Barbara Plank. (2017).

Learning to select data for transfer learning with
Bayesian Optimization.

2. Sebastian Ruder, Joachim Bingel, Isabelle
Augenstein, Anders Søgaard. (2017).

Sluice networks: Learning what to share between
loosely related tasks.

arXiv preprint arXiv:1705.08142
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
Why data selection for transfer learning?
• Knowing what data is relevant directly impacts
downstream

performance.
• E.g. sentiment

analysis: stark

performance

differences

across source

domains.
Learning to select data for transfer learning I
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
• Existing approaches
i. study domain similarity metrics in isolation;
ii. use only similarity;
iii. focus on a single task.

• Intuition: Different tasks and domains demand a
different notion of similarity.

• Idea: Learn the similarity metric.
Learning to select data for transfer learning II
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
• Treat similarity metric as black-box function; 

use Bayesian Optimisation (BayesOpt).
• Learn linear data selection metric :
where are data selection
features for each training example, is # of features,
and are weights learned by BayesOpt.

• Use to rank all training examples; select top 

for training.
Learning to select data for transfer learning III
31.05.17 | NLP Copenhagen
S
S = (X) · w|
(X) 2 Rn⇥l
l
w 2 Rl
S n
Current research
@seb_ruder |
Features
• Similarity: Jensen-Shannon divergence, Rényi
divergence, Bhattacharyya distance, cosine
similarity, Euclidean distance, variational distance.

• Representations: Term distributions, topic
distributions, word embeddings.

• Diversity: # of word types, type-token ratio, entropy,
Simpson's index, Rényi entropy, quadratic entropy
Learning to select data for transfer learning IV
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
• Tasks: sentiment analysis, part-of-speech (POS)
tagging, dependency parsing

• Datasets and models:
i. Sentiment analysis: Amazon reviews dataset (Blitzer
et al., 2006); linear SVM.
ii. POS tagging: SANCL 2012 shared task; Structured
Perceptron, SotA BiLSTM tagger (Plank et al., 2016).
iii. Dependency parsing: SotA BiLSTM parser
(Kiperwasser and Goldberg, 2016).
Learning to select data for transfer learning V
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
• Learned measures outperform baselines across all
tasks and domains.
• Competitive with SotA domain adaptation.
• Diversity complements similarity.
• Measures transfer across models, domains, tasks.
Learning to select data for transfer learning VI
31.05.17 | NLP Copenhagen
Feature set Book DVD Electronics Kitchen
Random 71.17 70.51 76.75 77.94
JS divergence (examples) 72.51 68.21 76.41 77.47
JS divergence (domain) 75.26 73.74 72.60 80.01
Similarity + diversity (term dists) 76.20 77.60 82.67 84.98
Similarity + diversity (topic dists) 77.16 79.00 81.92 84.92
Current research
@seb_ruder |
What if the target domain is unknown?
• Have a model that works well across most domains
(Ben-David et al., 2007).
• Many ways to do this:
- Data augmentation (mainly in computer vision)
-Regularisation
i. Norms, e.g. (lasso), , group lasso, etc.
ii. Dropout; noise;
iii. Multi-task learning (MTL)
Learning what to share for multi-task learning I
31.05.17 | NLP Copenhagen
`1 `2
Current research
@seb_ruder |
What if the target domain is unknown?
• Have a model that works well across most domains
(Ben-David et al., 2007).
• Many ways to do this:
- Data augmentation (mainly in computer vision)
-Regularisation
i. Norms, e.g. (lasso), , group lasso, etc.
ii. Dropout; noise;
iii. Multi-task learning (MTL)
Learning what to share for multi-task learning I
31.05.17 | NLP Copenhagen
`1 `2
Current research
@seb_ruder |
What if the target domain is unknown?
• Have a model that works well across most domains
(Ben-David et al., 2007).
• Many ways to do this:
- Data augmentation (mainly in computer vision)
-Regularisation
i. Norms, e.g. (lasso), , group lasso, etc.
ii. Dropout; noise;
iii. Multi-task learning (MTL)
Learning what to share for multi-task learning I
31.05.17 | NLP Copenhagen
`1 `2
Helps transfer knowledge across tasks!
Current research
@seb_ruder |
• Existing MTL approaches require amount of sharing
to be manually specified; hard to estimate if sharing
leads to improvements. 

• Propose Sluice Networks, a general MTL framework.
• 3 parts:
i. parameters that control sharing across tasks;
ii. parameters that control sharing across layers;
iii. orthogonality constraint to enforce subspaces.
Learning what to share for multi-task learning II
31.05.17 | NLP Copenhagen
↵
Current research
@seb_ruder |
Learning what to share for multi-task learning III
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
Experiments
• Data: OntoNotes 5.0; 7 domains.
• Tasks: chunking, NER, and semantic role labeling
(main task) and POS tagging (auxiliary task).
• Baselines: single-task model, hard parameter sharing
(Caruana, 1998), low supervision (Søgaard and
Goldberg, 2016), cross-stitch networks (Misra et al.,
2016).
• Results: Sluice networks outperform baselines on
both in-domain and out-of-domain data.
Learning what to share for multi-task learning IV
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
Analysis
• Task properties and performance
- Gains are higher with less training data;
- Sluice networks learn to share more with more
variance in data.

• Ablation analysis
- Learnable and values are better.
- Subspaces help for all domains.
- Concatenation of layer outputs also works.
Learning what to share for multi-task learning V
31.05.17 | NLP Copenhagen
↵
Current research
@seb_ruder |
Learning what to share for multi-task learning VI
31.05.17 | NLP Copenhagen
Current research
@seb_ruder |
Learning what to share for multi-task learning VI
31.05.17 | NLP Copenhagen
Conclusion
@seb_ruder |
• Many new models, tasks, and domains.
Lots of cool work to do in Transfer Learning for NLP.
• Lots of fundamental unanswered research questions:
-What is a domain? When are two domains similar?
-When are two tasks related? …
• When should we use Transfer Learning / MTL?
When does it work best?
31.05.17 | NLP Copenhagen
!
References
@seb_ruder |
Image credit
• Google Research blog post11
• Mikolov, T., Joulin, A., & Baroni, M. (2015). A Roadmap towards
Machine Intelligence. arXiv preprint arXiv:1511.08130.
• Google Research blog post12
My papers
• Sebastian Ruder, Barbara Plank. (2017). Learning to select data for
transfer learning with Bayesian Optimization.
• Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, Anders
Søgaard. (2017). Sluice networks: Learning what to share between
loosely related tasks. arXiv preprint arXiv:1705.08142
11 https://research.googleblog.com/2016/10/how-robots-can-acquire-new-skills-from.html
12 https://googleblog.blogspot.ie/2014/04/the-latest-chapter-for-self-driving-car.html
31.05.17 | NLP Copenhagen
@seb_ruder |
Thanks for your attention!
Questions?
31.05.17 | NLP Copenhagen

Contenu connexe

Tendances

Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep LearningJulien SIMON
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learningleopauly
 
Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Enterprise Knowledge
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentationRamandeep Kaur Bagri
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Explainable AI
Explainable AIExplainable AI
Explainable AIDinesh V
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 

Tendances (20)

Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Deep learning
Deep learning Deep learning
Deep learning
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 

Similaire à Transfer Learning for Natural Language Processing

ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleOrchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleStian Håklev
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Zachary Thomas
 
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery ivaderivader
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
 
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingAction Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingPeter Brusilovsky
 
Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Tariqul islam
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
2015-SemEval2015_poster
2015-SemEval2015_poster2015-SemEval2015_poster
2015-SemEval2015_posterhpcosta
 
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)Yun Huang
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionGyeongman Kim
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionGeonDoPark1
 
Recent developments in CS education research Jul 18
Recent developments in CS education research Jul 18Recent developments in CS education research Jul 18
Recent developments in CS education research Jul 18Sue Sentance
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 

Similaire à Transfer Learning for Natural Language Processing (20)

ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleOrchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingAction Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
 
Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
2015-SemEval2015_poster
2015-SemEval2015_poster2015-SemEval2015_poster
2015-SemEval2015_poster
 
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
LAK21 Data Driven Redesign of Tutoring Systems (Yun Huang)
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model Compression
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model Compression
 
Recent developments in CS education research Jul 18
Recent developments in CS education research Jul 18Recent developments in CS education research Jul 18
Recent developments in CS education research Jul 18
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
 

Plus de Sebastian Ruder

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionSebastian Ruder
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiSebastian Ruder
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimSebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSebastian Ruder
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENSebastian Ruder
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Sebastian Ruder
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisSebastian Ruder
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Sebastian Ruder
 

Plus de Sebastian Ruder (20)

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 

Dernier

Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxJosielynTars
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfAtiaGohar1
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 

Dernier (20)

Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptx
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 

Transfer Learning for Natural Language Processing

  • 1. Sebastian Ruder
 Research Scientist, AYLIEN PhD Candidate, Insight Centre, Dublin @seb_ruder |31.05.17 | NLP Copenhagen Transfer Learning for NLP
  • 2. Agenda 1. What is Transfer Learning?
 2. Why Transfer Learning now?
 3. Transfer Learning in practice
 4. Transfer Learning for NLP
 5. Current research @seb_ruder |31.05.17 | NLP Copenhagen
  • 3. What is Transfer Learning? @seb_ruder | Model A Model B Task / domain A Task / domain B Traditional ML 31.05.17 | NLP Copenhagen
  • 4. What is Transfer Learning? @seb_ruder | Model A Model B Task / domain A Task / domain B Traditional ML Training and evaluation on the same task or domain. 31.05.17 | NLP Copenhagen
  • 5. What is Transfer Learning? @seb_ruder | Knowledge Model Source task / domain Target task / domain Transfer learning Model 31.05.17 | NLP Copenhagen
  • 6. What is Transfer Learning? @seb_ruder | Knowledge Model Source task / domain Target task / domain Transfer learning Storing knowledge gained solving one problem and applying it to a different but related problem. Model 31.05.17 | NLP Copenhagen
  • 8. “Transfer learning will be the next driver of ML success.” Andrew Ng, NIPS 2016 tutorial
  • 9. Why Transfer Learning now? @seb_ruder | Supervised learning Transfer learning Unsupervised learning Reinforcement learning 2016Time Commercial success Drivers of ML success in industry - Andrew Ng, NIPS 2016 tutorial 31.05.17 | NLP Copenhagen
  • 10. Why Transfer Learning now? @seb_ruder | 1. Learn very accurate input-output mapping 2. Maturity of ML models - Computer vision (5% error on ImageNet) -Automatic speech recognition (3x faster than typing, 20% more accurate1) 3. Large-scale deployment & adoption of ML models -Google’s NMT System2 1 Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. (2016). Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices. arXiv preprint arXiv:1608.07323. 2 Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144. Huge reliance on labeled data 
 Novel tasks / domains without (labeled) data 31.05.17 | NLP Copenhagen
  • 11. Transfer Learning in practice @seb_ruder | • Train new model on features of large model trained on ImageNet3 • Train model to confuse source and target domains4 • Train model on domain- invariant representations5,6 3 Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 512–519. 4 Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. Proceedings of the 32nd International Conference on Machine Learning., 37. 5 Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain Separation Networks. NIPS 2016. 6 Sener, O., Song, H. O., Saxena, A., & Savarese, S. (2016). Learning Transferrable Representations for Unsupervised Domain Adaptation. NIPS 2016. Computer vision 31.05.17 | NLP Copenhagen
  • 12. Transfer Learning for NLP @seb_ruder | • Task and domainT D DS 6= DT TS 6= TT A more technical definition • Domain where - : feature space, e.g. BOW representations - : e.g. distribution over terms in documents D = {X, P(X)} X P(X) • Task where - : label space, e.g. true/false labels - : learned mapping from samples to labels T = {Y, P(Y |X)} Y P(Y |X) • Transfer learning:
 Learning when or 31.05.17 | NLP Copenhagen
  • 13. Transfer Learning for NLP @seb_ruder | Transfer scenarios 1. : Different topics, text types, etc.
 2. : Different languages.
 3. : Unbalanced classes.
 4. : Different tasks. P(XS) 6= P(XT ) XS 6= XT P(YS|XS) 6= P(YT |XT ) YS 6= YT 31.05.17 | NLP Copenhagen
  • 14. Transfer Learning for NLP @seb_ruder | Transfer scenarios 1. : Different topics, text types, etc.
 2. : Different languages.
 3. : Unbalanced classes.
 4. : Different tasks. P(XS) 6= P(XT ) XS 6= XT P(YS|XS) 6= P(YT |XT ) YS 6= YT 31.05.17 | NLP Copenhagen
  • 15. Transfer Learning for NLP @seb_ruder | Transfer scenarios 1. : Different topics, text types, etc.
 2. : Different languages.
 3. : Unbalanced classes.
 4. : Different tasks. P(XS) 6= P(XT ) XS 6= XT P(YS|XS) 6= P(YT |XT ) YS 6= YT 31.05.17 | NLP Copenhagen Domain adaptation
  • 16. Transfer Learning for NLP @seb_ruder | Training and test distributions are different. Different text types. Different accents/ages. Different topics/categories. Performance drop or even collapse is inevitable. 31.05.17 | NLP Copenhagen
  • 17. Transfer Learning for NLP @seb_ruder | Current status • Not as straightforward as in CV - No universal deep features • However: “Simple” transfer through word embeddings is pervasive • History of research for task-specific transfer, e.g. sentiment analysis, POS tagging leveraging NLP phenomena such as structured features, sentiment words, etc. • Few research on transfer between tasks • More recently: research on representations 31.05.17 | NLP Copenhagen
  • 18. Current research @seb_ruder | Transfer learning challenges in real-world scenarios 1. One-to-one adaptation is rare and many source domains are generally available. 2. Models need to be adapted frequently as conditions change, new data becomes available, etc.
 3. Target domains may be unknown or no target domain data may be available. 31.05.17 | NLP Copenhagen
  • 19. Current research @seb_ruder | Two recent examples 1. Sebastian Ruder, Barbara Plank. (2017).
 Learning to select data for transfer learning with Bayesian Optimization.
 2. Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, Anders Søgaard. (2017).
 Sluice networks: Learning what to share between loosely related tasks.
 arXiv preprint arXiv:1705.08142 31.05.17 | NLP Copenhagen
  • 20. Current research @seb_ruder | Why data selection for transfer learning? • Knowing what data is relevant directly impacts downstream
 performance. • E.g. sentiment
 analysis: stark
 performance
 differences
 across source
 domains. Learning to select data for transfer learning I 31.05.17 | NLP Copenhagen
  • 21. Current research @seb_ruder | • Existing approaches i. study domain similarity metrics in isolation; ii. use only similarity; iii. focus on a single task.
 • Intuition: Different tasks and domains demand a different notion of similarity.
 • Idea: Learn the similarity metric. Learning to select data for transfer learning II 31.05.17 | NLP Copenhagen
  • 22. Current research @seb_ruder | • Treat similarity metric as black-box function; 
 use Bayesian Optimisation (BayesOpt). • Learn linear data selection metric : where are data selection features for each training example, is # of features, and are weights learned by BayesOpt.
 • Use to rank all training examples; select top 
 for training. Learning to select data for transfer learning III 31.05.17 | NLP Copenhagen S S = (X) · w| (X) 2 Rn⇥l l w 2 Rl S n
  • 23. Current research @seb_ruder | Features • Similarity: Jensen-Shannon divergence, Rényi divergence, Bhattacharyya distance, cosine similarity, Euclidean distance, variational distance.
 • Representations: Term distributions, topic distributions, word embeddings.
 • Diversity: # of word types, type-token ratio, entropy, Simpson's index, Rényi entropy, quadratic entropy Learning to select data for transfer learning IV 31.05.17 | NLP Copenhagen
  • 24. Current research @seb_ruder | • Tasks: sentiment analysis, part-of-speech (POS) tagging, dependency parsing
 • Datasets and models: i. Sentiment analysis: Amazon reviews dataset (Blitzer et al., 2006); linear SVM. ii. POS tagging: SANCL 2012 shared task; Structured Perceptron, SotA BiLSTM tagger (Plank et al., 2016). iii. Dependency parsing: SotA BiLSTM parser (Kiperwasser and Goldberg, 2016). Learning to select data for transfer learning V 31.05.17 | NLP Copenhagen
  • 25. Current research @seb_ruder | • Learned measures outperform baselines across all tasks and domains. • Competitive with SotA domain adaptation. • Diversity complements similarity. • Measures transfer across models, domains, tasks. Learning to select data for transfer learning VI 31.05.17 | NLP Copenhagen Feature set Book DVD Electronics Kitchen Random 71.17 70.51 76.75 77.94 JS divergence (examples) 72.51 68.21 76.41 77.47 JS divergence (domain) 75.26 73.74 72.60 80.01 Similarity + diversity (term dists) 76.20 77.60 82.67 84.98 Similarity + diversity (topic dists) 77.16 79.00 81.92 84.92
  • 26. Current research @seb_ruder | What if the target domain is unknown? • Have a model that works well across most domains (Ben-David et al., 2007). • Many ways to do this: - Data augmentation (mainly in computer vision) -Regularisation i. Norms, e.g. (lasso), , group lasso, etc. ii. Dropout; noise; iii. Multi-task learning (MTL) Learning what to share for multi-task learning I 31.05.17 | NLP Copenhagen `1 `2
  • 27. Current research @seb_ruder | What if the target domain is unknown? • Have a model that works well across most domains (Ben-David et al., 2007). • Many ways to do this: - Data augmentation (mainly in computer vision) -Regularisation i. Norms, e.g. (lasso), , group lasso, etc. ii. Dropout; noise; iii. Multi-task learning (MTL) Learning what to share for multi-task learning I 31.05.17 | NLP Copenhagen `1 `2
  • 28. Current research @seb_ruder | What if the target domain is unknown? • Have a model that works well across most domains (Ben-David et al., 2007). • Many ways to do this: - Data augmentation (mainly in computer vision) -Regularisation i. Norms, e.g. (lasso), , group lasso, etc. ii. Dropout; noise; iii. Multi-task learning (MTL) Learning what to share for multi-task learning I 31.05.17 | NLP Copenhagen `1 `2 Helps transfer knowledge across tasks!
  • 29. Current research @seb_ruder | • Existing MTL approaches require amount of sharing to be manually specified; hard to estimate if sharing leads to improvements. 
 • Propose Sluice Networks, a general MTL framework. • 3 parts: i. parameters that control sharing across tasks; ii. parameters that control sharing across layers; iii. orthogonality constraint to enforce subspaces. Learning what to share for multi-task learning II 31.05.17 | NLP Copenhagen ↵
  • 30. Current research @seb_ruder | Learning what to share for multi-task learning III 31.05.17 | NLP Copenhagen
  • 31. Current research @seb_ruder | Experiments • Data: OntoNotes 5.0; 7 domains. • Tasks: chunking, NER, and semantic role labeling (main task) and POS tagging (auxiliary task). • Baselines: single-task model, hard parameter sharing (Caruana, 1998), low supervision (Søgaard and Goldberg, 2016), cross-stitch networks (Misra et al., 2016). • Results: Sluice networks outperform baselines on both in-domain and out-of-domain data. Learning what to share for multi-task learning IV 31.05.17 | NLP Copenhagen
  • 32. Current research @seb_ruder | Analysis • Task properties and performance - Gains are higher with less training data; - Sluice networks learn to share more with more variance in data.
 • Ablation analysis - Learnable and values are better. - Subspaces help for all domains. - Concatenation of layer outputs also works. Learning what to share for multi-task learning V 31.05.17 | NLP Copenhagen ↵
  • 33. Current research @seb_ruder | Learning what to share for multi-task learning VI 31.05.17 | NLP Copenhagen
  • 34. Current research @seb_ruder | Learning what to share for multi-task learning VI 31.05.17 | NLP Copenhagen
  • 35. Conclusion @seb_ruder | • Many new models, tasks, and domains. Lots of cool work to do in Transfer Learning for NLP. • Lots of fundamental unanswered research questions: -What is a domain? When are two domains similar? -When are two tasks related? … • When should we use Transfer Learning / MTL? When does it work best? 31.05.17 | NLP Copenhagen !
  • 36. References @seb_ruder | Image credit • Google Research blog post11 • Mikolov, T., Joulin, A., & Baroni, M. (2015). A Roadmap towards Machine Intelligence. arXiv preprint arXiv:1511.08130. • Google Research blog post12 My papers • Sebastian Ruder, Barbara Plank. (2017). Learning to select data for transfer learning with Bayesian Optimization. • Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, Anders Søgaard. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142 11 https://research.googleblog.com/2016/10/how-robots-can-acquire-new-skills-from.html 12 https://googleblog.blogspot.ie/2014/04/the-latest-chapter-for-self-driving-car.html 31.05.17 | NLP Copenhagen
  • 37. @seb_ruder | Thanks for your attention! Questions? 31.05.17 | NLP Copenhagen