1. Deep Learning & NLP
Graphs to the Rescue! (or not yet…)
Roelof Pieters, KTH/CSC, Graph Technologies R&D
roelof@kth.se
www.csc.kth.se/~roelof/
Twitter: @graphific
Stockholm, Sics, October 21 2014
2. Definitions
Machine Learning
Improving some task T based on experience E with
respect to performance measure P.
- T. Mitchell (1997)
Learning denotes changes in the system that are
adaptive in the sense that they enable the system to
do the same task (or tasks drawn from a population of
similar tasks) more effectively the next time.
- H. Simon (1983)
2
3. Definitions
Representation learning
Attempts to automatically learn good features or
representations
Deep learning
Attempt to learn multiple levels of representation of
increasing complexity/abstraction
3
4. Overview
1. From Machine Learning to Deep Learning
2. Natural Language Processing
3. Graph-Based Approaches to DL+NLP
4
12. • Back Propagation of Error
• Calculate total error at the top
• Calculate contributions to error at each step going
backwards
9
13. Phase 1: Propagation
Each propagation involves the following steps:
1. Forward propagation of a training pattern's input through the
neural network in order to generate the propagation's output
activations.
2. Backward propagation of the propagation's output activations
through the neural network using the training pattern target in
order to generate the deltas of all output and hidden neurons.
Phase 2: Weight update
For each weight-synapse follow the following steps:
1. Multiply its output delta and input activation to get the
gradient of the weight.
2. Subtract a ratio (percentage) of the gradient from the weight.
10
19. “2006”
• Faster machines (GPU’s!)
• More data
• New methods for unsupervised pre-training
12
20. “2006”
• New methods for unsupervised pre-training
• Stacked RBM’s (Deep Belief Networks [DBN’s] )
• Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning
algorithm for deep belief nets. Neural Computation,
18:1527-1554.
• Hinton, G. E. and Salakhutdinov, R. R, Reducing the
dimensionality of data with neural networks. Science, Vol. 313.
no. 5786, pp. 504 - 507, 28 July 2006.
13
• (Stacked) Autoencoders
• Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007).
Greedy Layer-Wise Training of Deep Networks, Advances in
Neural Information Processing Systems 19
21. Pretraining: Stacked RBM’s
• Iterative pre-training construction of Deep Belief
Network (DBN) (Hinton et al., 2006)
from: Larochelle et al. (2007). An Empirical Evaluation of Deep Architectures on
Problems with Many Factors of Variation.
14
23. Pretraining: Stacked Denoising Auto-encoder
16
• (Vincent et al, 2008)
• Good vs Corrupted context
from: Vincent et al 2010
24. Pretraining: Stacked Denoising Auto-encoder
16
• (Vincent et al, 2008)
• Good vs Corrupted context
Raw input
from: Vincent et al 2010
25. Pretraining: Stacked Denoising Auto-encoder
16
• (Vincent et al, 2008)
• Good vs Corrupted context
Corrupted input Raw input
from: Vincent et al 2010
26. Pretraining: Stacked Denoising Auto-encoder
16
• (Vincent et al, 2008)
• Good vs Corrupted context
Hidden code (representation)
Corrupted input Raw input
from: Vincent et al 2010
27. Pretraining: Stacked Denoising Auto-encoder
Corrupted input Raw input reconstruction
16
• (Vincent et al, 2008)
• Good vs Corrupted context
Hidden code (representation)
from: Vincent et al 2010
28. Pretraining: Stacked Denoising Auto-encoder
KL(reconstruction | raw input)
Corrupted input Raw input reconstruction
16
• (Vincent et al, 2008)
• Good vs Corrupted context
Hidden code (representation)
from: Vincent et al 2010
30. Convolutional Neural Networks (CNNs)
• Fukushima 1980; LeCun et al. 1998; Behnke 2003; Simard et al. 2003…
• Hinton et al. 2006; Bengio et al.
2007; Ranzato et al. 2007
• Sparse connectivity:
18
• MaxPooling
• Shared weights:
(Figures from http://deeplearning.net/tutorial/lenet.html)
31. Pretraining
• Why does Pretraining work so well? (Erhan et al. 2010)
• Better Generalisation
without unsupervised pretraining with unsupervised pretraining)
Figures from Erhan et al. 2010
19
44. So: Why Deep?
Deep Architectures can be representationally efficient
• Fewer computational units for same function
Deep Representations might allow for a hierarchy or
representation
• Allows non-local generalisation
• Comprehensibility
Multiple levels of latent variables allow combinatorial sharing
of statistical strength
25
45. So: Why Deep?
Generalizing better to new tasks & domains
Can learn good intermediate representations shared
across tasks
Distributed representations
Unsupervised Learning
Multiple levels of representation
26
46. Diff Levels of Abstraction
• Hierarchical Learning
• Natural progression from low
level to high level structure
as seen in natural complexity
• Easier to monitor what is
being learnt and to guide the
machine to better subspaces
• A good lower level
representation can be used
for many distinct tasks
27
51. DL + NLP
• Language Modeling
• Bengio et al. (2000, 2003): via Neural network
• Mnih and Hinton (2007): via RBMs
• Pos, Chunking, NER, SRL
• Collobert and Weston 2008
• Socher et al 2011; Socher 2014
31
52. Language Modeling
• Word Embeddings (Bengio et al, 2001; Bengio et
al, 2003) based on idea of distributed
representations for symbols (Hinton 1986)
• Neural Word embeddings (Turian et al 2010;
Collobert et al. 2011)
32
53. Word Embeddings
• Collobert & Weston 2008; Collobert et al. 2011
• similar to word vector learning, but uses instead of
single scalar score, a Softmax/Maxent classifier
word embeddings in from lookup table. From Collobert et al. 2011
33
54. Word Embeddings
• Collobert & Weston 2008; Collobert et al. 2011
• similar to word vector
learning, but uses instead
of single scalar score, a
Softmax/Maxent classifier
Figure from Socher et al. Tutorial ACL 2012.
34
56. • window approach
• sentence approach
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips 36
57. • Multi-task learning
37
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
58. 38
General Deep Architecture for NLP
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
59. 38
General Deep Architecture for NLP
Basic features
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
60. 38
General Deep Architecture for NLP
Basic features
Embeddings
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
61. 38
General Deep Architecture for NLP
Basic features
Embeddings
Convolution
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
62. 38
General Deep Architecture for NLP
Basic features
Embeddings
Convolution
Max pooling
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
63. 38
General Deep Architecture for NLP
Basic features
Embeddings
Convolution
Max pooling
“Supervised” learning
source: Collobert & Weston, Deep Learning for Natural Language Processing. 2009 Nips
64. Word Embeddings
• Unsupervised Word Representations (Turian et al
2010)
• evaluates Brown clusters, C&W (Collobert and
Weston 2008) embeddings, and HLBL (Mnih &
Hinton, 2009) embeddings of words -> Brown
clusters win out with a small margin on both NER
and chunking.
• more info: http://metaoptimize.com/projects/
wordreprs/
39
65. 40
t-SNE visualizations of word embeddings. Left: Number Region; Right:
Jobs Region. From Turian et al. 2011
67. Word Embeddings
• Collobert & Weston 2008; Collobert et al. 2011
• Propose a unified neural network architecture, for
many NLP tasks:
• part-of-speech tagging, chunking, named entity
recognition, and semantic role labeling
• no hand-made input features
• learns internal representations on the basis of vast
amounts of mostly unlabeled training data.
42
68. Word Embeddings
• Recurrent Neural Network (Mikolov et al. 2010;
Mikolov et al. 2013a)
W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
Figures from Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic
Regularities in Continuous Space Word Representations
43
69. • Mikolov et al. 2013b
Figures from Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b).
Efficient Estimation of Word Representations in Vector Space
44
70. Word Embeddings
• Recursive (Tensor) Network (Socher et al. 2011;
Socher 2014)
45
93. C. Factor Graph
• Factor graph in which the factors themselves contain a deep neural net.
• Factor graph:
• bipartite graph representing the factorization of a function (Kschischang et al.
2001; Frey 2002)
• can combine Bayesian networks (BNs) and Markov random fields (MRFs).
Figure from Frey 2002
65
94. Factor Graph
• Factor graph with “deep factors” (Mirowski & LeCun 2009)
• Dynamic Time Series modeling
66
95. Energy-Based Graph
• LeCun et al. 1998, handwriting recognition
system
• “Graph Transformer Networks”
• Instead of normalised HMM, energy
based factor graph (without normalization)
• LeCun et al. 2006.
• Energy-Based Learning
67
97. And finally…
and Finally…
What you’ve all been waiting for…
Which Net is currently the Biggest ?
68
98. And finally…
and Finally…
What you’ve all been waiting for…
Which Net is currently the Biggest ?
68
the Deepest
99. And finally…
and Finally…
What you’ve all been waiting for…
Which Net is currently the Biggest ?
68
the Deepest
The most Bad-ass ?
100. Winners of:
Large Scale Visual Recognition Challenge 2014
(ILSVRC2014)
19 September 2014
source: Szegedy et al. Going deeper with convolutions (GoogLeNet ), ILSVRC2014, 19 Sep 2014 69
101. Winners of:
Large Scale Visual Recognition Challenge 2014
(ILSVRC2014)
19 September 2014
GoogLeNet
Convolution
Pooling
Softmax
Other
source: Szegedy et al. Going deeper with convolutions (GoogLeNet ), ILSVRC2014, 19 Sep 2014 69
102. Large Scale Visual Recognition Challenge 2014
GoogLeNet
Convolution
Pooling
Softmax
Other
Winners of:
(ILSVRC2014)
19 September 2014
GoogLeNet
Convolution
Pooling
Softmax
Other
source: Szegedy et al. Going deeper with convolutions (GoogLeNet ), ILSVRC2014, 19 Sep 2014 69
103. Inception
256 480 480
512 512 512
832 832 1024
Width of inception modules ranges from 256 filters (in early modules) to 1024 in top inception
modules.
Can remove fully connected layers on top completely
Number of parameters is reduced to 5 million
Computional cost is increased by
less than 2X compared to
Krizhevsky’s network. (<1.5Bn
operations/evaluation)
source: Szegedy et al. Going deeper with convolutions (GoogLeNet ), ILSVRC2014, 19 Sep 2014 70
104. 71
Classification results on ImageNet 2012
Team Year Place Error (top-5) Uses external
data
SuperVision 2012 - 16.4% no
SuperVision 2012 1st 15.3% ImageNet 22k
Clarifai 2013 - 11.7% no
Clarifai 2013 1st 11.2% ImageNet 22k
MSRA 2014 3rd 7.35% no
VGG 2014 2nd 7.32% no
GoogLeNet 2014 1st 6.67% no
Final Detection Results
Team Year Place mAP e x t e r n a l
data
ensemble c o n t e x t u a l
model
approach
UvA-Euvision 2013 1st 22.6% none ? yes F i s h e r
vectors
Deep Insight 2014 3rd 40.5% I L S V R C 1 2
Classification
+ Localization
3 models yes ConvNet
C U H K
DeepID-Net
2014 2nd 40.7% I L S V R C 1 2
Classification
+ Localization
? no ConvNet
GoogLeNet 2014 1st 43.9% I L S V R C 1 2
Classification 6 models no ConvNet
Detection results
source: Szegedy et al. Going deeper with convolutions (GoogLeNet ), ILSVRC2014, 19 Sep 2014
106. Wanna Play?
• Theano - CPU/GPU symbolic expression compiler in python
(from LISA lab at University of Montreal). http://
deeplearning.net/software/theano/
• Pylearn2 - Pylearn2 is a library designed to make machine
learning research easy. http://deeplearning.net/software/
pylearn2/
• Torch - provides a Matlab-like environment for state-of-the-art
machine learning algorithms in lua (from Ronan Collobert,
Clement Farabet and Koray Kavukcuoglu) http://torch.ch/
• more info: http://deeplearning.net/software links/
(slide partially stolen from: J. Sullivan, Convolutional Neural Networks &
Computer Vision, Machine Learning meetup at Spotify, Stockholm, June 9
2014)
73
108. Bibliography: Definitions
• Mitchell, T. M. (1997). Machine Learning (1st ed.). New York, NY,
USA: McGraw-Hill, Inc.
• Simon, H.A. (1983). Why should machines learn? in: Machine
Learning: An Artificial Intelligence Approach, (R. Michalski, J.
Carbonell, T. Mitchell, eds) Tioga Press, 25-38.
75
109. Bibliography: History
• Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report
85-460-1, Cornell Aeronautical Laboratory.
• Minsky & Papert (1969), Perceptrons: an introduction to computational geometry.
• Bryson, A.E.; W.F. Denham; S.E. Dreyfus (1963) Optimal programming problems with inequality
constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 2544-2550.
• Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). "Learning representations
by back-propagating errors". Nature 323 (6088): 533–536.
• Boser, B. E., Guyon, I., and Vapnik, V. (1992). A training algorithm for optimal margin classifiers.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–
152. ACM Press.
• Cortes, C. and Vapnik, V. (1995), Support-vector network. Machine Learning, 20:273–297.
• Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An Empirical
Evaluation of Deep Architectures on Problems with Many Factors of Variation. In Proceedings of
the 24th International Conference on Machine Learning (pp. 473–480). New York, NY, USA:
ACM.
• Vincent, P., Larochelle, H., & Lajoie, I. (2010), Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion. Journal of Machine Learning
Research, 11, 3371–3408.
76
110. Bibliography: History - CNN’s
• Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a
Mechanism of Pattern Recognition Unaffected by Shift in Position". Biological Cybernetics 36
(4): 193–202. doi:10.1007/BF00344251. PMID 7370364. Retrieved 16 November 2013.
• LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning
applied to document recognition". Proceedings of the IEEE 86 (11): 2278–2324.
• S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume 2766 of Lecture
Notes in Computer Science. Springer, 2003.
• Simard, Patrice, David Steinkraus, and John C. Platt. "Best Practices for Convolutional Neural
Networks Applied to Visual Document Analysis." In ICDAR, vol. 3, pp. 958-962. 2003.
• Hinton, GE; Osindero, S; Teh, YW (Jul 2006). "A fast learning algorithm for deep belief nets.".
Neural computation 18 (7): 1527–54.
• Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2007). "Greedy Layer-Wise
Training of Deep Networks". Advances in Neural Information Processing Systems: 153–160.
• Ranzato, MarcAurelio; Poultney, Christopher; Chopra, Sumit; LeCun, Yann (2007). "Efficient
Learning of Sparse Representations with an Energy-Based Model". Advances in Neural
Information Processing Systems.
77
111. Bibliography: DL
• Bengio, Y., Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model.
In T. K. Leen & T. G. Dietterich (Eds.), Advances in Neural Information Processing
Systems 13 (NIPS’00). MIT Press.
• Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic
Language Model. The Journal of Machine Learning Research, 3, 1137–1155.
• Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training
of Deep Networks, Advances in Neural Information Processing Systems 19
• Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings
of the eighth annual conference of the cognitive science society (Vol. 1, p. 12).
• Hinton, G. E. and Salakhutdinov, R. R, (2006) Reducing the dimensionality of data with
neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.
• Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18:1527-1554.
• Erhan, D., Bengio, Y., & Courville, A. (2010). Why does unsupervised pre-training help
deep learning? Journal of Machine Learning Research, 11, 625–660.
78
112. Bibliography: DL
• P. Vincent, P., Larochelle, H., Bengio, Y. and Manzagol, P. A. (2008) Extracting and
composing robust features with denoising autoencoders. In ICML.
• Vincent, P., Larochelle, H., & Lajoie, I. (2010). Stacked denoising autoencoders:
Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11, 3371–3408. Bengui 2009
• Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012) Imagenet classification with
deep convolutional neural networks. In NIPS.
• Socher, Richard, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and
Christopher D. Manning. (2011). Semi-supervised recursive autoencoders for
predict- ing sentiment distributions. In Proceedings of the 2011 Conference on
Empiri- cal Methods in Natural Language Processing (EMNLP).
• Dahl, G. E., Ranzato, M. A., Mohamed, A. and Hinton, G. E. (2010) Phone
recognition with the mean-covariance restricted Boltzmann machine. In NIPS.
• Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep Big
Simple Neural Nets Excel on Handwritten Digit Recognition. CoRR.
• Szegedy et al. (2014) Going deeper with convolutions (GoogLeNet ), ILSVRC2014,
19 Sep 2014
79
113. Bibliography: NLP
• Turian, J., Ratinov, L., & Bengio, Y. (2010). Word Representations: A Simple and
General Method for Semi-supervised Learning. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics (pp. 384–394).
Stroudsburg, PA, USA: Association for Computational Linguistics.
• Collobert, R., & Weston, J. (2008). A unified architecture for natural language
processing: Deep neural networks with multitask learning. Proceedings of the 25th
International Conference ….
• Collobert, R., Weston, J., & Bottou, L. (2011). Natural language processing (almost)
from scratch. The Journal of Machine Learning Research, 12:2493-2537.
• Collobert & Weston, Deep Learning for Natural Language Processing (2009) Nips
Tutorial
• Mikolov, T., Yih, W., & Zweig, G. (2013a). Linguistic Regularities in Continuous
Space Word Representations. HLT-NAACL, (June), 746–751.
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient Estimation of Word
Representations in Vector Space, 1–12. Computation and Language.
80
114. Bibliography: NLP
• Bengio, Y. and Bengio, S (2000) Modeling high- dimensional
discrete data with multi-layer neural networks. In Proceedings of
NIPS 12
• Mnih, A. and Hinton, G. E. (2007) Three New Graphical Models for
Statistical Language Modelling. International Conference on
Machine Learning, Corvallis, Oregon.
• Socher, R., Bengio, Y., & Manning, C. (2012). Deep Learning for
NLP (without Magic). Tutorial Abstracts of ACL 2012.
• Socher, R. (2014). recursive deep learning for natural language
processing and computer vision. Dissertation.
81
115. Bibliography: Graph-Based Approaches
• Frey, B. (2002). Extending factor graphs so as to unify directed and
undirected graphical models. Proceedings of the Nineteenth Conference on
Uncertainty in Artificial Intelligence 19 (UAI 03), Morgan Kaufmann, CA,
Acapulco, Mexico, 257–264.
• F. R. Kschischang, B. J. Frey, H. A. L. (2001). Factor graphs and the sum-product
algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.
• Mirowski, P., & LeCun, Y. (2009). Dynamic factor graphs for time series
modeling. Machine Learning and Knowledge Discovery.
• LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE November 1998.
• LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. A., & Huang, F. J. (2006). A
Tutorial on Energy-Based Learning 1 Introduction : Energy-Based Models, 1–
59.
82
116. Bibliography: Graph-Based Approaches
• Buerli, M., & Obispo, C. (2012). The current state of graph databases.
Department of Computer Science, Cal Poly San Luis Obispo
• Ganesan, K., Zhai, C., & Han, J. (2010). Opinosis: a graph-based approach to
abstractive summarization of highly redundant opinions. Proceedings of the
23rd International Conference on Computational Linguistics (Coling 2010),
(August), 340–348.
• Ganesan, K. (2013). Opinion Driven Decision Support System. PhD
Dissertation, University of Illinois.
• Bastani, K. 2014a, Hierarchical Pattern Recognition, Blog: Meaning Of, June
17, 2014
• Bastani, K. 2014b, Using a Graph Database for Deep Learning Text
Classification, Blog: Meaning Of, August 26, 2014
• Bastani, K. 2014c, Deep Learning Sentiment Analysis for Movie Reviews using
Neo4j, Blog: Meaning Of, September 15, 2014
83