5. Leadertable (ImageNet image classification)24 Olga Russakov
Image classification
Year Codename Error (percent) 99.9% Conf Int
2014 GoogLeNet 6.66 6.40 - 6.92
2014 VGG 7.32 7.05 - 7.60
2014 MSRA 8.06 7.78 - 8.34
2014 AHoward 8.11 7.83 - 8.39
2014 DeeperVision 9.51 9.21 - 9.82
2013 Clarifai†
11.20 10.87 - 11.53
2014 CASIAWS†
11.36 11.03 - 11.69
2014 Trimps†
11.46 11.13 - 11.80
2014 Adobe†
11.58 11.25 - 11.91
2013 Clarifai 11.74 11.41 - 12.08
2013 NUS 12.95 12.60 - 13.30
2013 ZF 13.51 13.14 - 13.87
2013 AHoward 13.55 13.20 - 13.91
2013 OverFeat 14.18 13.83 - 14.54
2014 Orange†
14.80 14.43 - 15.17
2012 SuperVision†
15.32 14.94 - 15.69
2012 SuperVision 16.42 16.04 - 16.80
2012 ISI 26.17 25.71 - 26.65
2012 VGG 26.98 26.53 - 27.43
2012 XRCE 27.06 26.60 - 27.52
2012 UvA 29.58 29.09 - 30.04
Single-object localization
Year Codename Error (percent) 99.9% Conf Int
2014 VGG 25.32 24.87 - 25.78
2014 GoogLeNet 26.44 25.98 - 26.92
Fig. 10 For each object class, we consider the b
mance of any entry submitted to ILSVRC2012-20
ing entries using additional training data. The
the distribution of these “optimistic” per-class resu
mance is measured as accuracy for image classific
and for single-object localization (middle), and
precision for object detection (right). While the
very promising in image classification, the ILSVR
are far from saturated: many object classes cont
challenging for current algorithms.
dence interval. We run a large number of b
ping rounds (from 20,000 until convergence)
shows the results of the top entries to eac
ILSVRC2012-2014. The winning methods a
tically significantly di↵erent from the other
even at the 99.9% level.Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
Unofficial human error is around 5.1% on a subset
Why still have a human error? When labeling, human raters judged whether it belongs to a class (binary classification); however the challenge is a 1000-class classification problem.
2015 ResNet (ILSVRC’15) 3.57
GoogLeNet, 22 layers network
Microsoft ResNet, a 152 layers network
U. of Toronto, SuperVision, a 7 layers network
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
6. Deep Learning helps Natural Language Processing
https://code.google.com/p/word2vec/
ualization of Regularities in Word Vector Space
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
princess
prince
heroine
hero
actress
actor
landlady
landlord
female
male
cow
bull
hen
cockqueen
king
she
he
23 / 31
From bag of words to distributed representation of words to decipher the sematic meaning of a
word
Compositionality by Vector Addition
Expression Nearest tokens
Czech + currency koruna, Czech crown, Polish zloty, CTK
Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese
German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa
Russian + river Moscow, Volga River, upriver, Russia
French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
8. ‘Deep Learning’ buzz
‘deep learning’ is reborn in 2006 and becomes a buzzword recently with the explosion
of 'big data', 'data science', and their applications mentioned in the media.
big data
neural networks
data science
deep learning
9. What is Deep Learning?
• “deep learning” has its roots in “neural networks”
• deep learning creates many layers (deep) of
neurons,attempting to learn structured representation of
big data, layer by layer
A computerised
artificial neuron
Le, Quoc V. "Building high-level features using
large scale unsupervised learning." Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE
International Conference on. IEEE, 2013.
A biological
neuron
10. Brief History
• The First wave
– 1943 McCulloch and Pitts proposed the McCulloch-Pitts
neuron model
– 1958 Rosenblatt introduced the simple single layer
networks now called Perceptrons.
– 1969 Minsky and Papert’s book Perceptrons demonstrated
the limitation of single layer perceptrons, and almost the
whole field went into hibernation.
• The Second wave
– 1986 The Back-Propagation learning algorithm for Multi-
Layer Perceptrons was rediscovered and the whole field
took off again.
• The Third wave
– 2006 Deep (neural networks) Learning gains popularity and
– 2012 made significant break-through in many applications.
11. Summary
• Neural networks basics
– its backpropagation learning algorithm
• Go deeper in layers
• Convolutional neural networks (CNN)
– address feature dependency
• Recurrent neural networks (RNN)
– address time-dependency
• Deep reinforcement learning
– have the ability of real-time control
14. Early Artificial Neurons (1943-1969)
• McCulloch- Pitts neuron
• McCulloch and Pitts
[1943] proposed a
simple model of a
neuron as computing
machine
• The artificial neuron
computes a weighted
sum of its inputsfrom
other neurons, and
outputs a one or a zero
according to whether
the sum is above or
below a certain
threshold
x1
x3
x2
f (x) =
1 if x ≥ 0;
0 otherwise
⎧
⎨
⎪
⎩⎪
y
y = f ( wm xm − b
m
∑ )
15. Early Artificial Neurons (1943-1969)
• Rosenblatt’s single layer perceptron
• Rosenblatt [1958] further
proposed the perceptron
as the first model for
learning with a teacher
(i.e., supervised learning)
• Focused on how to find
appropriate weights wm
for two-class classification
task (y = 1: class one and
y = -1: class two)
y =ϕ( wm xm − b
m
∑ )
ϕ(x) =
1 if x ≥ 0;
−1 otherwise
⎧
⎨
⎪
⎩⎪
ron is built around a nonlinear neuron, namely, the McCulloch–Pitts
From the introductory chapter we recall that such a neural model
ombiner followed by a hard limiter (performing the signum func-
Fig. 1.1. The summing node of the neural model computes a lin-
he inputs applied to its synapses,as well as incorporates an externally
sulting sum, that is, the induced local field, is applied to a hard
Inputs
Hard
limiter
Bias b
x1
x2
xm
wm
w2
w1
w(и)v Output
y•
•
•
Activation function:
16. Early Artificial Neurons (1943-1969)
• Rosenblatt’s single layer perceptron
• Proved the convergence of a
learning algorithm if two classes
said to be linearly separable (i.e.,
patterns that lie on opposite sides
of a hyperplane)
• Many people hope that such
machines could be a basis for
artificial intelligence
y =ϕ( wm xm + b
m
∑ )
ϕ(x) =
1 if x ≥ 0;
−1 otherwise
⎧
⎨
⎪
⎩⎪
1.2 PERCEPTRON
Rosenblatt’s perceptron is built around a nonlinear neuron, namely, the McCulloch–Pitts
model of a neuron. From the introductory chapter we recall that such a neural model
consists of a linear combiner followed by a hard limiter (performing the signum func-
tion), as depicted in Fig. 1.1. The summing node of the neural model computes a lin-
ear combination of the inputs applied to its synapses,as well as incorporates an externally
applied bias. The resulting sum, that is, the induced local field, is applied to a hard
Inputs
Hard
limiter
Bias b
x1
x2
xm
wm
w2
w1
w(и)v Output
y•
•
•
FIGURE 1.1 Signal-flow
graph of the perceptron.
y(t)
=ϕ( wm
(t)
xm
(t)
+ b(t)
m
∑ ), t = {1,...,T}
d(t)
is training label and η is a parameter (the learning rate)
Δwm
(t)
=η(d(t)
− y(t)
)xm
(t)
wm
(t+1)
= wm
(t+1)
+ Δwm
(t)
Haykin S S, Haykin S S, Haykin S S, et al. Neural networks and learning machines[M]. Upper Saddle River: Pearson Education, 2009.
17. Early Artificial Neurons (1943-1969)
• The XOR problem
• However, Minsky and Papert
[1969] shows that some rather
elementary computations, such as
XOR problem, could not be done
by Rosenblatt’s one-layer
perceptron
• However Rosenblatt believed the
limitations could be overcome if
more layers of units to be added,
but no learning algorithm known
to obtain the weights yet
• Due to the lack of learning
algorithms people left the neural
network paradigm for almost 20
years
Input x Output y
X1 X2 X1 XOR X2
0 0 0
0 1 1
1 0 1
1 1 0
X1
1 true false
false true
0 1 X2
XOR is non linearly separable: These two
classes (true and false) cannot be separated
using a line.
19. Hidden layers and the backpropagation (1986~)
• But the solution is quite often not unique
Input x Output y
X1 X2 X1 XOR X2
0 0 0
0 1 1
1 0 1
1 1 0
The number in the circle is a threshold
http://recognize-speech.com/basics/introduction-to-artificial-neural-networks
http://www.cs.stir.ac.uk/research/publications/techreps/pdf/TR148.pdf
(solution 1)
(solution 2)
ϕ(⋅) is the sign function
two lines are necessary to divide the sample
space accordingly
22. Example Application: Face Detection
Figure 2. System for Neural Network based face detection in the compressed domain
Rowley, Henry A., Shumeet Baluja, and Takeo Kanade. "Neural network-based face detection." Pattern Analysis and Machine Intelligence, IEEE
Transactions on 20.1 (1998): 23-38.
Wang J, Kankanhalli M S, Mulhem P, et al. Face detection using DCT coefficients in MPEG video[C]//Proceedings of the International Workshop
on Advanced Imaging Technology (IWAIT 2002). 2002: 66-70.
23. Example Application: Face Detection
Figure 2. System for Neural Network based face detection in the compressed domain
Eleftheriadis[13] tackle this problem by including
additional face patterns arising from different block
alignment positions as positive training examples. But this
method induces too many variations, not necessarily
belonging to inter-class variations, to the positive training
samples. This in turn makes the high frequency DCT
coefficients unreliable for both face model training and
face detection functions.
Fortunately, there exist fast algorithms to calculate
reconstructed DCT coefficients for overlapping blocks
[16][17][18]. These methods help to calculate DCT
coefficients for scan window steps of even upto one pixel.
However a good tradeoff between speed and accuracy is
obtained by reconstructing the coefficients for scan
window steps of two pixels.
In order to detect faces of different sizes, compressed
Figure 3. Training Data Preparation
Suppose x1, x1, x2, Ö, xn are the DCT coefficients retained
from the feature extraction stage and x1
(j)
,x2
(j)
,Ö,xn
(j)
; j =
1,2,Ö,p are the corresponding DCT coefficients retained
from the training examples where n is the number of
DCT coefficient features retained(Currently, we use 126
DCT coefficient features extracted from 4_4 block sized
square window) and p is the number of training samples.
The upper bounds (Ui) and lower bounds (Li) can be
estimated by
Ui = α*max{1,xi
(1)
,Ö,xi
(p)
}, I = 1,2,Ö,n; (1)
Wang J, Kankanhalli M S, Mulhem P, et al. Face detection using DCT coefficients in MPEG video[C]//Proceedings of the International Workshop
on Advanced Imaging Technology (IWAIT 2002). 2002: 66-70.
24. how it works (training)
d1 =1
d2 = 0
x1
x2
xm
y1
Error Caculation∑
Error backpropagation
Parameter
s
weights
Parameter
s
weights
label = Face
label = no face
Training instances…
y0
32. An example for BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation.
1. Calculate errors of output neurons
δα = outα (1 - outα) (Targetα - outα)
δ = out (1 - out ) (Target - out )
Inputs Hidden layer Outputs
A
B
C
Ω
λ
α
β
WAα
WAβ
WBα
WBβ
WCα
WCβ
WΩA
WλA
WΩB
WλB
WΩC
WλC
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
34. Let us do some calculation
To illustrate this let’s do a worked Example.
Worked example 3.1:
Consider the simple network below:
Assume that the neurons have a Sigmoid activation function and
(i) Perform a forward pass on the network.
(ii) Perform a reverse pass (training) once (target = 0.5).
(iii) Perform a further forward pass and comment on the result.
Answer:
(i)
Input
A = 0.35
Input
B = 0.9
Output
0.1
0.8
0.6
0.4
0.3
0.9
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
35. Let us do some calculation
Assume that the neurons have a Sigmoid activation function and
(i) Perform a forward pass on the network.
(ii) Perform a reverse pass (training) once (target = 0.5).
(iii) Perform a further forward pass and comment on the result.
Answer:
(i)
Input to top neuron = (0.35x0.1)+(0.9x0.8)=0.755. Out = 0.68.
Input to bottom neuron = (0.9x0.6)+(0.35x0.4) = 0.68. Out = 0.6637.
Input to final neuron = (0.3x0.68)+(0.9x0.6637) = 0.80133. Out = 0.69.
(ii)
Output error δ=(t-o)(1-o)o = (0.5-0.69)(1-0.69)0.69 = -0.0406.
New weights for output layer
w1+
= w1+(δ x input) = 0.3 + (-0.0406x0.68) = 0.272392.
w2+
= w2+(δ x input) = 0.9 + (-0.0406x0.6637) = 0.87305.
Errors for hidden layers:
δ1 = δ x w1 = -0.0406 x 0.272392 x (1-o)o = -2.406x10-3
δ2= δ x w2 = -0.0406 x 0.87305 x (1-o)o = -7.916x10-3
New hidden layer weights:
w3+
=0.1 + (-2.406 x 10-3
x 0.35) = 0.09916.
w4+
= 0.8 + (-2.406 x 10-3
x 0.9) = 0.7978.
w5+
= 0.4 + (-7.916 x 10-3
x 0.35) = 0.3972.
w6+
= 0.6 + (-7.916 x 10-3
x 0.9) = 0.5928.
(iii)
Old error was -0.19. New error is -0.18205. Therefore error has reduced.
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
37. Activation functions
• Logistic Sigmoid:
fSigmoid (x) =
1
1+e−x
• Output range [0,1]
• Motivated by biological neurons and can
be interpreted as the probability of an
artificial neuron “firing” given its inputs
• However, saturatedneurons make
gradients vanished (why?)
Its derivative:
f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
38. Activation functions
• Tanh function
ftanh (x) =
sinh(x)
cosh(x)
=
ex
− e−x
ex
+ e−x
• Output range [-1,1]
• Thus strongly negative inputs to the tanh
will map to negative outputs.
• Only zero-valued inputs are mapped to
near-zero outputs
• These properties make the network less
likely to get “stuck” during training
Its gradient:
ftanh '(x) =1− ftanh '(x)
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]
- zero centered (nice)
- still kills gradients when saturated :(
[LeCun et al., 199
https://theclevermachine.wordpress.com/tag/tanh-function/
39. Active Functions
• ReLU (rectified linear unit)
fReLU (x) = max(0,x) • Another version is
Noise ReLU:
• ReLU can be approximated by
softplus function
• ReLU gradient doesn't vanish as we
increase x
• It canbe used to model positive number
• It is fast as no need for computing the
exponential function
• It eliminates the necessity to have a
“pretraining” phase
The derivative:
for x>0, f’=1, and x<0, g’=0.
fsoftplus (x) = log(1+ex
)
fnoisyReLU (x) = Max(0, x + N(0,δ(x)))
http://static.googleusercontent.com/media/research.google.
com/en//pubs/archive/40811.pdf
40. Active Functions
• ReLU (rectified linear unit)
fReLU (x) = max(0,x)
ReLU can be approximated by softplus function
fsoftplus (x) = log(1+ex
)
http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf
• The only non-linearity comes from
the path selection with individual
neurons being active or not
• It allows sparse representations:
• for a given input only a subset
of neurons are active
Sparse propagation of activations and gradientsAdditional active functions:
Leaky ReLU,Exponential LU,Maxout etc
46. Why goes deep now?
• Big data everywhere
• Demands for:
– Unsupervised learning: most of them are
unlabeled
– End-to-end: trainable feature solution
• time consuming for hand-crafted features
• Improved computational power:
– Widely used GPUs for computing
– Distributed computing (mapreduce & spark)
47. Deep learning problem
• Difficult to train
– Many many Parameters
• Two solutions:
– Regularised by prior knowledge (since 1989 )
• Convolutional Neural Networks (CNN)
– New method of pre-training using unsupervised
methods (since 2006)
• Stacked auto-encoders
• Stacked Restricted Boltzmann Machines
48. New methods for unsupervised pre-training
“2006”
• New methods for unsupervised pre-training
13
• Stacked RBM’s (Deep Belief Networks [DBN’s] )
• Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning
algorithm for deep belief nets. Neural Computation,
18:1527-1554.
• Hinton, G. E. and Salakhutdinov, R. R, Reducing the
dimensionality of data with neural networks. Science, Vol. 313.
no. 5786, pp. 504 - 507, 28 July 2006.
• (Stacked) Autoencoders
• Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007).
Greedy Layer-Wise Training of Deep Networks, Advances in
Neural Information Processing Systems 19
Stacked Restricted Boltzmann Machines
Stacked Autoencoders
49. Restricted Boltzmann Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic
generative models
– contains m visible units V = (V1, ..., Vm) to represent observable data and
– n hidden units H = (H1, ..., Hn) to capture their dependencies
• Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint distribution: , where the energy function
sig(): sigmoid
We learn the parameters wij by maximising
a log-likelihood given example v:
No connections within a layer (restricted) leads to
<*>
data
: expectations under the data empirical distribution
<*>
model
: expectations under the model
Hinton, Geoffrey. "A practical guide to training restricted Boltzmann
machines." Momentum 9.1 (2010): 926.
50. Restricted Boltzmann Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic
generative models
– contains m visible units V = (V1, ..., Vm) to represent observable data and
– n hidden units H = (H1, ..., Hn) to capture their dependencies
• Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint distribution: , where the energy function
sig(): sigmoid
We learn the parameters wij by maximising
a log-likelihood given example v:
No connections within a layer (restricted) leads to
<*>
data
: expectations under the data empirical distribution
<*>
model
: expectations under the model
Hinton, Geoffrey. "A practical guide to training restricted Boltzmann
machines." Momentum 9.1 (2010): 926.
51. Restricted Boltzmann Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic
generative models
– contains m visible units V = (V1, ..., Vm) to represent observable data and
– n hidden units H = (H1, ..., Hn) to capture their dependencies
• Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint distribution: , where the energy function
sig(): sigmoid
We learn the parameters wij by maximising
a log-likelihood given example v:
No connections within a layer (restricted) leads to
<*>
data
: expectations under the data empirical distribution
<*>
model
: expectations under the model
Hinton, Geoffrey. "A practical guide to training restricted Boltzmann
machines." Momentum 9.1 (2010): 926.
52. Restricted Boltzmann Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic
generative models
– contains m visible units V = (V1, ..., Vm) to represent observable data and
– n hidden units H = (H1, ..., Hn) to capture their dependencies
• Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint distribution: , where the energy function
sig(): sigmoid
We learn the parameters wij by maximising
a log-likelihood given example v:
No connections within a layer (restricted) leads to
<*>
data
: expectations under the data empirical distribution
<*>
model
: expectations under the model
Hinton, Geoffrey. "A practical guide to training restricted Boltzmann
machines." Momentum 9.1 (2010): 926.
Can be approximated by Contrastive Divergence
53. Stacked RBM
• The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack.
• After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation
of error derivatives.
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.
54. Stacked RBM
• The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack.
• After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation
of error derivatives.
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.
• Unsupervised learning resultsFig. 4. (A) The fraction of
retrieved documents in the
same class as the query when
a query document from the
test set is used to retrieve other
test set documents, averaged
over all 402,207 possible que-
ries. (B) The codes produced
by two-dimensional LSA. (C)
The codes producedby a2000-
500-250-125-2 autoencoder.
28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org6
tion of
in the
y when
om the
e other
veraged
le que-
oduced
SA. (C)
a2000-
ncoder.
28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org
The codes produced by two-dimensional LSA (Latent Semantic Analysis).
The codes produced by a 2000- 500-250-125-2 StackedRBM
55. Auto-encoder
• Multi-layer neural network with target output and input:
– Reconstruction xo = decoder(encoder(input xi))
– Minimizing reconstruction error (xi - xo)2
http://deeplearning.net/tutorial/dA.html
h = f (We
T
xi + be )
xo = f (Wd
T
h + bd )
encoder decoder
xo
h
xi
f :sigmoidSetting : Tied weights We
T
= Wd ,
• The goal is to learn We, be, bd such that
the average reconstruction error is minimised
• If f is chosen to be linear, auto-encoder is
equivalent to PCA; if non-linear, it captures
multi-modal aspects of the input
distribution
56. Denoising auto-encoder
• A stochastic version of the auto-encoder.
• We train the autoencoder to reconstruct the input from a
corrupted version of it , so that
– force the hidden layer to discover more robust features and
– prevent it from simply learning the identity
http://deeplearning.net/tutorial/dA.html
encoder decoder
xo
y
Corrupted xi
• The stochastic corruption process :
randomly sets some of the inputs (as
many as half of them) to zero
• The goal is to predict the missing values
from the uncorrupted (i.e., non-missing)
values
57. Stacked Denoising Auto-encoder
Pretraining: Stacked Denoising Auto-encoder
15
• Stacking Auto-Encoders
from: Bengio ICML 2009
Parameter matrix W’ is transpose of W
Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks[J]. Advances in neural information
processing systems, 2007, 19: 153.
Unsupervised greedy layer-wise training procedure
(1) The first layer
pre-training
(2) The second
layer pre-training
(3) The while network fine-
tuning by label
58. Convolutional neural networks:
Receptive field
• Receptive field: Neurons in the
retina respond to light stimulus in
restricted regions of the visual field
• Animal experiments on receptive
fields of two retinal ganglion cells
– Fields are circular areas of the
retina
– The cell (upper part) responds
when the center is illuminated and
the surround is darkened.
– The cell (lower part) responds
when the center is darkened and
the surround is illuminated.
– Both cells give on- and off-
responses when both center and
surround are illuminated, but
neither response is as strong as
when only center or surround is
illuminated
Hubel D.H. : The Visual Cortex of the Brain Sci Amer 209:54-62, 1963
Contributed by Hubel and Wiesel for the studies
of the visual system of a cat
“On” Center Field
“Off” Center Field
Light On
59. Convolutional neural networks
• Sparse connectivity by local
correlation
– Filter: the input of a hidden unit in
layer m are from a subset of units
in layer m-1 that have spatially
connected receptive fields
• Shared weights
– each filter is replicated across the
entire visual field. These
replicated units share the same
weights and form a feature map.
http://deeplearning.net/tutorial/lenet.html
Translational invariance with convolutional neural networks
e previous section, we see that using locality structures significantly reduces the number of connections
n further reduction can be achieved via another technique called “weight sharing.” In weight sharing
e of the parameters in the model are constrained to be equal to each other. For example, in the following
of a network, we have the following constraints w1 = w4 = w7, w2 = w5 = w8 and w3 = w6 = w9
es that have the same color have the same weight):
edges that have the same color have the same weight
2-d case (subscripts are weights)
1-d case
m layer
m-1 layer
m-1 layer
one filter
at m layer
60. LeNet 5
INPUT
32x32
Convolutions SubsamplingConvolutions
C1: feature maps
6@28x28
Subsampling
S2: f. maps
6@14x14
S4: f. maps 16@5x5
C5: layer
120
C3: f. maps 16@10x10
F6: layer
84
Full connection
Full connection
Gaussian connections
OUTPUT
10
LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code recognition.
Neural computation, 1989, 1(4): 541-551.
• LeNet 5 consists of 7 layers, all of which contain trainable weights;
• End-to end solution: they are used to replace hand crafted feature extraction
• Cx: convolution layers with 5x5 filter
• Sx: subsampling (pooling) layers
• Fx: fully-connected layers
61. • Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image
LeNet 5: Convolution Layer
f
Input image Feature map10x10 filter 3x3 8x8
Activation
function
61Convolutions
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
62. LeNet 5: Convolution Layer
62
Activation
functionInput image Feature map10x10 8x8
Response
kernel 3x3
f
f
f
Convolutions
• Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image
• 3 different filters (weights are different) lead to 3 8x8 out images.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
63. LeNet 5: (Sampling) Pooling Layer
Max pooling
Max in a 2x2
filter
Average pooling
Average in a
2x2 filter
63Sampling
• Pooling: partitions the input image into a set of non-overlapping rectangles and,
for each such sub-region, outputs the maximum or average value.
Max pooling
• reduces computation and
• is a way of taking the most
responsive node of the given
interest region,
• but may result in loss of
accurate spatial information
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
64. LeNet 5 results
• MNIST (handwritten digits) Dataset:
– 60k training and 10k test examples
• Test error rate 0.95%
Training error (no distortions)
Test error (no distortions)
Test error
(with distortions)
Training Set Size (x1000)
20 30 40 50 60 70 80 90 100
ate (%)
4−>6 3−>5 8−>2 2−>1 5−>3 4−>8 2−>8 3−>5 6−>5 7−>3
9−>4 8−>0 7−>8 5−>3 8−>7 0−>6 3−>7 2−>7 8−>3 9−>4
8−>2 5−>3 4−>8 3−>9 6−>0 9−>8 4−>9 6−>1 9−>4 9−>1
9−>4 2−>0 6−>1 3−>5 3−>2 9−>5 6−>0 6−>0 6−>0 6−>8
4−>6 7−>3 9−>4 4−>6 2−>7 9−>7 4−>3 9−>4 9−>4 9−>4
8−>7 4−>2 8−>4 3−>5 8−>4 6−>5 8−>5 3−>8 3−>8 9−>8
1−>5 9−>8 6−>3 0−>2 6−>5 9−>5 0−>7 1−>6 4−>9 2−>1
2−>8 8−>5 4−>9 7−>2 7−>2 6−>5 9−>7 6−>1 5−>6 5−>0
4−>9 2−>8
Total only 82 errors from LeNet-5. correct
answer left and right is the machine answer.
4
8
3
C1 S2 C3 S4 C5
F6
Output
3
4 4 4
4
34
8
3
C1 S2 C3 S4 C5
F6
Output
4
8
3
C1 S2 C3 S4 C5
F6
Output
3
4 4 4
4
34
8
3
C1 S2 C3 S4 C5
F6
Output
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
http://yann.lecun.com/exdb/mnist/
65. From MNIST to ImageNet
• ImageNet
– Over 15M labeled high
resolution images
– Roughly 22K categories
– Collected from web and
labeled by Amazon
Mechanical Turk
• The Image/scene
classification challenge
– Image/scene
classification
– Metric: Hit@5 error rate
- make 5 guesses about
the image label
http://cognitiveseo.com/blog/6511/will-google-read-rank-images-near-future/
12 Olga Russakovsky* et al.
Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from
ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using
scene-level queries.
tage of all the positive examples available. The second
source (24%) is negative images which were part of the
original ImageNet collection process but voted as neg-
ative: for example, some of the images were collected
from Flickr and search engines for the ImageNet synset
“animals” but during the manual verification step did
not collect enough votes to be considered as containing
an “animal.” These images were manually re-verified
for the detection task to ensure that they did not in
fact contain the target objects. The third source (13%)
is images collected from Flickr specifically for the de-
tection task. These images were added for ILSVRC2014
following the same protocol as the second type of images
in the validation and test set. This was done to bring
the training and testing distributions closer together.
Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
Hit@5 error rate
66. Leadertable (ImageNet image classification)24 Olga Russakov
Image classification
Year Codename Error (percent) 99.9% Conf Int
2014 GoogLeNet 6.66 6.40 - 6.92
2014 VGG 7.32 7.05 - 7.60
2014 MSRA 8.06 7.78 - 8.34
2014 AHoward 8.11 7.83 - 8.39
2014 DeeperVision 9.51 9.21 - 9.82
2013 Clarifai†
11.20 10.87 - 11.53
2014 CASIAWS†
11.36 11.03 - 11.69
2014 Trimps†
11.46 11.13 - 11.80
2014 Adobe†
11.58 11.25 - 11.91
2013 Clarifai 11.74 11.41 - 12.08
2013 NUS 12.95 12.60 - 13.30
2013 ZF 13.51 13.14 - 13.87
2013 AHoward 13.55 13.20 - 13.91
2013 OverFeat 14.18 13.83 - 14.54
2014 Orange†
14.80 14.43 - 15.17
2012 SuperVision†
15.32 14.94 - 15.69
2012 SuperVision 16.42 16.04 - 16.80
2012 ISI 26.17 25.71 - 26.65
2012 VGG 26.98 26.53 - 27.43
2012 XRCE 27.06 26.60 - 27.52
2012 UvA 29.58 29.09 - 30.04
Single-object localization
Year Codename Error (percent) 99.9% Conf Int
2014 VGG 25.32 24.87 - 25.78
2014 GoogLeNet 26.44 25.98 - 26.92
Fig. 10 For each object class, we consider the b
mance of any entry submitted to ILSVRC2012-20
ing entries using additional training data. The
the distribution of these “optimistic” per-class resu
mance is measured as accuracy for image classific
and for single-object localization (middle), and
precision for object detection (right). While the
very promising in image classification, the ILSVR
are far from saturated: many object classes cont
challenging for current algorithms.
dence interval. We run a large number of b
ping rounds (from 20,000 until convergence)
shows the results of the top entries to eac
ILSVRC2012-2014. The winning methods a
tically significantly di↵erent from the other
even at the 99.9% level.Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
Unofficial human error is around 5.1% on a subset
Why human error still? When labeling, human raters judged whether it belongs to a class (binary classification); the challenge is a 1000-class classification problem.
2015 ResNet (ILSVRC’15) 3.57
GoogLeNet, 22 layers network
Microsoft ResNet, a 152 layers network
U. of Toronto, SuperVision, a 7 layers network
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
67. SuperVision
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks,
Advances in neural information processing systems. 2012: 1097-1105.
• Rooted from leNet-5
• Rectified Linear Units, overlapping pooling, dropout trick
• Randomly extracted 224x224 patches from 256x256 images for more training data
68. dropout
• Dropout randomly ‘drops’ units from a layer on each training step, creating ‘sub-
architectures’ within the model.
• It can be viewed as a type of sampling of a smaller network within a larger
network
• Prevent Neural Networks from Overfitting
Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from
overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
70. ResNet: Deep Residual Learning
• Overfitting prevents a neural network to go deeper
• A Thought experiment:
– We have a shallower architecture(e.g. logistic regression)
– Add more layers onto it to construct its deeper counterpart
– So a deeper model should produce no higher training error than
its shallower counterpart, if
• the added layers are identity mapping, and the other layers are copied
from the learned shallower model.
• However, it is reported that adding more layers to a very
deep model leads to higher training error
He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).
https://github.com/KaimingHe/deep-residual-networks
al Learning for Image Recognition
angyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
xiangz, v-shren, jiansun}@microsoft.com
to train. We
the training
n those used
ers as learn-
er inputs, in-
provide com-
hese residual
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
trainingerror(%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
testerror(%)
56-layer
20-layer
56-layer
20-layer
Figure 1. Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error. Similar phenomena
esidual Learning for Image Recognition
Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
e difficult to train. We
rk to ease the training
eeper than those used
e the layers as learn-
to the layer inputs, in-
ons. We provide com-
ng that these residual
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
trainingerror(%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
testerror(%)
56-layer
20-layer
56-layer
20-layer
Figure 1. Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error. Similar phenomena
71. • Solution: feedforwardneural networks with “shortcut connections”
• This leads to incredible 152-layer deep convolution network, driving the
error to 3.57% in the classification!
ResNet: Deep Residual Learning
identity
weight layer
weight layer
relu
relu
F(x) + x
x
F(x)
x
Figure 2. Residual learning: a building block.
re comparably good or better than the constructed solution
or unable to do so in feasible time).
In this paper, we address the degradation problem by
ntroducing a deep residual learning framework. In-
ead of hoping each few stacked layers directly fit a
esired underlying mapping, we explicitly let these lay-
rs fit a residual mapping. Formally, denoting the desired
nderlying mapping as H(x), we let the stacked nonlinear
ayers fit another mapping of F(x) := H(x) x. The orig-
nal mapping is recast into F(x)+x. We hypothesize that it
easier to optimize the residual mapping than to optimize
ImageNet test set, and won the 1st place in
2015 classification competition. The extrem
resentations also have excellent generalization
on other recognition tasks, and lead us to fu
1st places on: ImageNet detection, ImageNe
COCO detection, and COCO segmentation i
COCO 2015 competitions. This strong eviden
the residual learning principle is generic, and w
it is applicable in other vision and non-vision
2. Related Work
Residual Representations. In image recogn
[18] is a representation that encodes by the re
with respect to a dictionary, and Fisher Vecto
formulated as a probabilistic version [18] of
of them are powerful shallow representations
trieval and classification [4, 48]. For vector
encoding residual vectors [17] is shown to b
tive than encoding original vectors.
He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).
https://github.com/KaimingHe/deep-residual-networks
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
image
3x3conv,512
3x3conv,64
3x3conv,64
pool,/2
3x3conv,128
3x3conv,128
pool,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
pool,/2
3x3conv,512
3x3conv,512
3x3conv,512
pool,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
pool,/2
fc4096
fc4096
fc1000
image
output
size:112
output
size:224
output
size:56
output
size:28
output
size:14
output
size:7
output
size:1
VGG-1934-layerplain
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
image
34-layerresidual
Figure3.ExamplenetworkarchitecturesforImageNet.Left:the
VGG-19model[41](19.6billionFLOPs)asareference.Mid-
dle:aplainnetworkwith34parameterlayers(3.6billionFLOPs).
Right:aresidualnetworkwith34parameterlayers(3.6billion
FLOPs).Thedottedshortcutsincreasedimensions.Table1shows
moredetailsandothervariants.
ResidualN
insertshort
networkint
shortcuts(E
outputareo
Fig.3).Wh
inFig.3),w
performsid
forincreasi
parameter;(
matchdime
options,wh
sizes,theya
3.4.Imple
Ourimp
in[21,41].
domlysamp
A224⇥224
horizontalfl
standardco
normalizati
beforeactiv
asin[13]an
useSGDw
startsfrom0
andthemod
useaweigh
donotused
Intesting
10-croptes
convolution
atmultiple
sideisin{2
4.Experi
4.1.Image
Weevalu
cationdatas
aretrained
atedonthe
resultonth
Weevaluate
PlainNetw
plainnets.T
18-layerpla
tailedarchit
Theresu
nethashigh
plainnet.T
paretheirtr
cedure.We
4
1. the desired underlying mapping as H(x).
2. In stead of fitting H(x), a stack of nonlinear layers fit
another mapping of F(x) := H(x)−x.
3. The original mapping is recast into F(x)+x
4. During training, it would be easy to push the
residual (F(x)) to zero (create an identity mapping) by
a stack of nonlinear layers.
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
plain-20
plain-32
plain-44
plain-56
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-11056-layer
20-layer
110-layer
20-layer
Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing erro
of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 120
73. Recurrent neural networks
• Why we need another NN model
– Sequential prediction
– Time series
• A simple recurrent neural network (RNN)
– Training can be done by BackPropagation Through Time (BPTT)
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
s = f (xU)
o = f (sV)
x :input vector, o :output vector,
s : hidden state vector,
U : layer 1 param. matrix,
V : layer 2 param. matrix,
f : tanh or ReLU
Two-layer feedforward network
Add time-dependency
of the hidden state s
W : State transition param. matrix
st+1 = f (xt+1U+ stW)
ot+1 = f (st+1V)
74. LSTMs (Long Short Term Memory networks)
• Standard Recurrent Network(SRN) has a problem of learning long-term
dependency (due to vanishing gradients of saturated nodes)
• a LSTM provides another way to compute a hidden state
Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
st−1 st
xt
ct−1 ct
ot
...
f
i o
input gate
forget gate
output gate
“candidate” hidden state:
Cell internal memory
Hidden state
σ :sigmoid (control signal between 0 and 1); o: elementwise multiplication
st−1
xt
ot...
st
st = tanh(xtU+ st−1W)
SRN cell
LSTM cell
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
75. • ClockWork-RNN is similar to a simple RNN with an input, output and hidden layer
• Difference lies in
– The hidden layer is partitioned into g modules each with its own clock rate
– Neurons in faster module are connected to neurons in a slower module
RNN applications: time series
Koutnik, Jan, et al. "A clockwork rnn." arXiv preprint arXiv:1402.3511 (2014).
A Clockwork RNN
Figure 1. CW-RNN architecture is similar to a simple RNN with an input, output and hidden layer. The hidden layer is partitioned into g
modules each with its own clock rate. Within each module the neurons are fully interconnected. Neurons in faster module i are connected
to neurons in a slower module j only if a clock period Ti < Tj.
2. Related Work
Contributions to sequence modeling and recognition that are
relevant to CW-RNNs are presented in this section. The pri-
mary focus is on RNN extensions that deal with the problem
of bridging long time lags.
Hierarchical Recurrent Neural Networks (Hihi & Bengio,
1996) are the most similar to CW-RNNs work, having both
delayed connections and units operating at different time-
scales. Five hand-designed architectures were tested on
have been very successful recently in speech and handwrit-
ing recognition (Graves et al., 2005; 2009; Sak et al., 2014).
Stacking LSTMs (Fernandez et al., 2007; Graves &
Schmidhuber, 2009) into a hierarchical sequence proces-
sor, equipped with Connectionist Temporal Classification
(CTC; Graves et al., 2006), performs simultaneous segmen-
tation and recognition of sequences. Its deep variant cur-
rently holds the state-of-the-art result in phoneme recogni-
tion on the TIMIT database (Graves et al., 2013).
A Clockwork RNN
76. Word embedding
• From bag of word to word embedding
– Use a a real-valued vector in Rm to represent a word (concept)
v(‘‘cat")=(0.2, -0.4, 0.7, ...)
v(‘‘mat")=(0.0, 0.6, -0.1, ...)
• Continuous bag of word (CBOW) model (word2vec)
– Input/output words x/y are one-hot encoded
– Hidden layer is shared for all input words
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Continuous bag of word (CBOW) model
Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).
where C is the number of words in the context, w1, · · · , wC are the words the
and vw is the input vector of a word w. The loss function is
E = = log p(wO|wI,1, · · · , wI,C)
= uj⇤ + log
VX
j0=1
exp(uj0 )
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h)
which is the same as (7), the objective of the one-word-context model, ex
di↵erent, as defined in (18) instead of (1).
Input layer
Hidden layer
Output layer
WV×N
WV×N
WV×N
W'N×V yjhi
x2k
x1k
xCk
C×V-dim
N-dim
V-dim
Figure 2: Continuous bag-of-word model
The update equation for the hidden!output weights stay the same a
one-word-context model (11). We copy it here:
v0
wj
(new)
= v0
wj
(old)
⌘ · ej · h for j = 1, 2, · · · , V.
Note that we need to apply this to every element of the hidden!output we
each training instance.
prediction error, the more significant e↵ects a word will exert on the movement on the
input vector of the context word.
As we iteratively update the model parameters by going through context-target word
pairs generated from a training corpus, the e↵ects on the vectors will accumulate. We
can imagine that the output vector of a word w is “dragged” back-and-forth by the input
vectors of w’s co-occurring neighbors, as if there are physical strings between the vector
of w and the vectors of its neighbors. Similarly, an input vector can also be considered as
being dragged by many output vectors. This interpretation can remind us of gravity, or
force-directed graph layout. The equilibrium length of each imaginary string is related to
the strength of cooccurrence between the associated pair of words, as well as the learning
rate. After many iterations, the relative positions of the input and output vectors will
eventually stabilize.
1.2 Multi-word context
Figure 2 shows the CBOW model with a multi-word context setting. When computing
the hidden layer output, instead of directly copying the input vector of the input context
word, the CBOW model takes the average of the vectors of the input context words, and
use the product of the input!hidden weight matrix and the average vector as the output.
h =
1
C
W · (x1 + x2 + · · · + xC) (17)
=
1
C
· (vw1 + vw2 + · · · + vwC ) (18)
5
Hidden nodes:
The cross-entropy loss:
j
j0=1
j
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h) (21)
which is the same as (7), the objective of the one-word-context model, except that h is
di↵erent, as defined in (18) instead of (1).
Figure 2: Continuous bag-of-word model
The update equation for the hidden!output weights stay the same as that for the
one-word-context model (11). We copy it here:
v0
wj
(new)
= v0
wj
(old)
⌘ · ej · h for j = 1, 2, · · · , V. (22)
Note that we need to apply this to every element of the hidden!output weight matrix for
each training instance.
6
The update equation for input!hidden weights is similar to (16), except that now we
need to apply the following equation for every word wI,c in the context:
v(new)
wI,c
= v(old)
wI,c
1
C
· ⌘ · EH for c = 1, 2, · · · , C. (23)
where vwI,c is the input vector of the c-th word in the input context; ⌘ is a positive learning
rate; and EH = @E
@hi
is given by (12). The intuitive understanding of this update equation
is the same as that for (16).
The gradient updates:
N-dim Vector
representation
of a word
V: vocabulary size;
C: num. input words;
v: row vector of input matrix W;
v’: row vector of output matrix W’
where C is the number of words in the context, w1, · · · , wC are the words the in the context,
and vw is the input vector of a word w. The loss function is
E = = log p(wO|wI,1, · · · , wI,C) (19)
= uj⇤ + log
VX
j0=1
exp(uj0 ) (20)
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h) (21)
which is the same as (7), the objective of the one-word-context model, except that h is
di↵erent, as defined in (18) instead of (1).
where C is the number of words in the context, w1, · · · , wC are the words the in the context,
and vw is the input vector of a word w. The loss function is
E = = log p(wO|wI,1, · · · , wI,C) (19)
= uj⇤ + log
VX
j0=1
exp(uj0 ) (20)
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h) (21)
which is the same as (7), the objective of the one-word-context model, except that h is
di↵erent, as defined in (18) instead of (1).
or
v0
wj
(new)
= v0
wj
(old)
⌘ · ej · h for j = 1, 2, · · · , V.
where ⌘ > 0 is the learning rate, ej = yj tj, and hi is the i-th unit in the hidden
v0
wj
is the output vector of wj. Note that this update equation implies that we ha
go through every possible word in the vocabulary, check its output probability yj
compare yj with its expected output tj (either 0 or 1). If yj > tj (“overestimating”)
we subtract a proportion of the hidden vector h (i.e., vwI ) from v0
wO
, thus making
farther away from vwI ; if yj < tj (“underestimating”), we add some h to v0
wO
, thus m
v0
wO
closer3 to vwI . If yj is very close to tj, then according to the update equation
little change will be made to the weights. Note, again, that vw (input vector) an
(output vector) are two di↵erent vector representations of the word w.
Update equation for input!hidden weights
Having obtained the update equations for W0, we can now move on to W. We tak
derivative of E on the output of the hidden layer, obtaining
@E
@hi
=
VX
j=1
@E
@uj
·
@uj
@hi
=
VX
j=1
ej · w0
ij := EHi
where hi is the output of the i-th unit of the hidden layer; uj is defined in (2), th
input of the j-th unit in the output layer; and ej = yj tj is the prediction error
77. Remarkable properties from Word embedding
• Simple algebraic operations with the word vectors
v(‘‘woman")−v(‘‘man") ≃ v(‘‘aunt")−v(‘‘uncle")
v(‘‘woman")−v(‘‘man") ≃ v(‘‘queen")−v(‘‘king")
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." HLT-NAACL. 2013.
Figure 2: Left panel shows vector offsets for three word
pairs illustrating the gender relation. Right panel shows
a different projection, and the singular/plural relation for
two words. In high-dimensional space, multiple relations
can be embedded for a single word.
provided. We have explored several related meth-
ods and found that the proposed method performs
well for both syntactic and semantic relations. We
note that this measure is qualitatively similar to rela-
tional similarity model of (Turney, 2012), which pre-
dicts similarity between members of the word pairs
(xb, xd), (xc, xd) and dis-similarity for (xa, xd).
6 Experimental Results
To evaluate the vector offset method, we used
vectors generated by the RNN toolkit of Mikolov
M
LS
LS
LS
RN
RN
RN
RN
Tabl
diffe
M
RN
CW
CW
HL
HL
Tabl
lobe
Log
Figure 2: Left panel shows vector offsets for three word
pairs illustrating the gender relation. Right panel shows
a different projection, and the singular/plural relation for
two words. In high-dimensional space, multiple relations
can be embedded for a single word.
provided. We have explored several related meth-
ods and found that the proposed method performs
well for both syntactic and semantic relations. We
note that this measure is qualitatively similar to rela-
tional similarity model of (Turney, 2012), which pre-
dicts similarity between members of the word pairs
(xb, xd), (xc, xd) and dis-similarity for (xa, xd).
6 Experimental Results
Vector offsets for
gender relation
The singular/plural relation for
two words
Word the relationship is defined by subtracting two word vectors, and the result is added to
another word. Thus for example, Paris - France + Italy = Rome.
Using X = v(”biggest”) − v(”big”) + v(”small”) as query and searching for the nearest word based on
cosine distance results in v(”smallest”)
Single-prototype English embeddings by Huang
et al. (2012) are used to initialize Chinese em-
beddings. The initialization readily provides a set
(Align-Init) of benchmark embeddings in experi-
ments (Section 4), and ensures translation equiva-
lence in the embeddings at start of training.
3.2.2 Bilingual training
Using the alignment counts, we form alignment
matrices Aen→zh and Azh→en. For Aen→zh, each
row corresponds to a Chinese word, and each col-
umn an English word. An element aij is first as-
signed the counts of when the ith Chinese word is
aligned with the jth English word in parallel text.
After assignments, each row is normalized such that
it sums to one. The matrix Azh→en is defined sim-
ilarly. Denote the set of Chinese word embeddings
as Vzh, with each row a word embedding, and the
set of English word embeddings as Ven. With the
two alignment matrices, we define the Translation
Equivalence Objective:
JTEO-en→zh = ∥Vzh − Aen→zhVen∥2
(3)
JTEO-zh→en = ∥Ven − Azh→enVzh∥2
(4)
We optimize for a combined objective during train-
ing. For the Chinese embeddings we optimize for:
JCO-zh + λJTEO-en→zh (5)
For the English embeddings we optimize for:
JCO-en + λJTEO-zh→en (6)
During bilingual training, we chose the value of λ
such that convergence is achieved for both JCO and
JTEO. A small validation set of word similarities
3.3 Curriculum training
We train 100k-vocabulary word embeddings using
curriculum training (Turian et al., 2010) with Equa-
tion 5. For each curriculum, we sort the vocabu-
lary by frequency and segment the vocabulary by a
band-size taken from {5k, 10k, 25k, 50k}. Separate
bands of the vocabulary are trained in parallel using
minibatch L-BFGS on the Chinese Gigaword cor-
pus 3. We train 100,000 iterations for each curricu-
lum, and the entire 100k vocabulary is trained for
500,000 iterations. The process takes approximately
19 days on a eight-core machine. We show visual-
ization of learned embeddings overlaid with English
in Figure 1. The two-dimensional vectors for this vi-
sualization is obtained with t-SNE (van der Maaten
and Hinton, 2008). To make the figure comprehen-
sible, subsets of Chinese words are provided with
reference translations in boxes with green borders.
Words across the two languages are positioned by
the semantic relationships implied by their embed-
dings.
Figure 1: Overlaid bilingual embeddings: English words
are plotted in yellow boxes, and Chinese words in green;
reference translations to English are provided in boxesZou, Will Y., et al. "Bilingual Word Embeddings for Phrase-Based Machine Translation." EMNLP. 2013.
78. Neural Language models
• n-gram model
– Construct conditional probabilities for the
next word, given combinations of the last n-
1 words ( contexts)
• Neural language model
– associate with each word a distributed word
feature vector for word embedding,
– express the joint probability function of
word sequences using those vectors, and
– learn simultaneously the word feature
vectors and the parameters of that
probability function.
he training points (e.g., training sentences) is distributed in a larger volume, usually in some form
eighborhood around the training points. In high dimensions, it is crucial to distribute probability
s where it matters rather than uniformly in all directions around each training point. We will
w in this paper that the way in which the approach proposed here generalizes is fundamentally
erent from the way in which previous state-of-the-art statistical language modeling approaches
generalizing.
A statistical model of language can be represented by the conditional probability of the next
d given all the previous ones, since
ˆP(wT
1 ) =
T
∏
t=1
ˆP(wt|wt 1
1 ),
ere wt is the t-th word, and writing sub-sequence w
j
i = (wi,wi+1,··· ,wj 1,wj). Such statisti-
language models have already been found useful in many technological applications involving
ural language, such as speech recognition, language translation, and information retrieval. Im-
vements in statistical language models could thus have a significant impact on such applications.
When building statistical models of natural language, one considerably reduces the difficulty
his modeling problem by taking advantage of word order, and the fact that temporally closer
ds in the word sequence are statistically more dependent. Thus, n-gram models construct ta-
of conditional probabilities for the next word, for each one of a large number of contexts, i.e.
mbinations of the last n 1 words:
ˆP(wt|wt 1
1 ) ⇡ ˆP(wt|wt 1
t n+1).
only consider those combinations of successive words that actually occur in the training cor-
or that occur frequently enough. What happens when a new combination of n words appears
was not seen in the training corpus? We do not want to assign zero probability to such cases,
ause such new combinations are likely to occur, and they will occur even more frequently for
where
1
t=1
1
d, and writing sub-sequence w
j
i = (wi,wi+1,··· ,wj 1,wj). Such statisti-
e already been found useful in many technological applications involving
s speech recognition, language translation, and information retrieval. Im-
anguage models could thus have a significant impact on such applications.
tical models of natural language, one considerably reduces the difficulty
m by taking advantage of word order, and the fact that temporally closer
nce are statistically more dependent. Thus, n-gram models construct ta-
bilities for the next word, for each one of a large number of contexts, i.e.
n 1 words:
ˆP(wt|wt 1
1 ) ⇡ ˆP(wt|wt 1
t n+1).
combinations of successive words that actually occur in the training cor-
ntly enough. What happens when a new combination of n words appears
raining corpus? We do not want to assign zero probability to such cases,
nations are likely to occur, and they will occur even more frequently for
mple answer is to look at the probability predicted using a smaller context
rigram models (Katz, 1987) or in smoothed (or interpolated) trigram mod-
1980). So, in such models, how is generalization basically obtained from
in the training corpus to new sequences of words? A way to understand
ink about a generative model corresponding to these interpolated or back-
ntially, a new sequence of words is generated by “gluing” very short and
gth 1, 2 ... or up to n words that have been seen frequently in the training
ning the probability of the next piece are implicit in the particulars of the
n-gram algorithm. Typically researchers have used n = 3, i.e. trigrams,
-art results, but see Goodman (2001) for how combining many tricks can
BENGIO, DUCHARME, VINCENT AND JAUVIN
softmax
tanh
. . . . . .. . .
. . . . . .
. . . . . .
across words
most computation here
index for index for index for
shared parameters
Matrix
in
look−up
Table
. . .
C
C
wt 1wt 2
C(wt 2) C(wt 1)C(wt n+1)
wt n+1
i-th output = P(wt = i | context)
Figure 1: Neural architecture: f(i,wt 1,··· ,wt n+1) = g(i,C(wt 1),··· ,C(wt n+1
neural network and C(i) is the i-th word feature vector.
parameters of the mapping C are simply the feature vectors themselves, represenBengio, Yoshua, et al. "Neural probabilistic language models." Innovations in Machine Learning. Springer Berlin Heidelberg, 2006. 137-186.