SlideShare une entreprise Scribd logo
1  sur  111
Télécharger pour lire hors ligne
Deep	Learning
Dr.	Jun	Wang
Computer	Science,	UCL
As	part of COMPGI15/M052	Information	Retrieval	and	Data	Mining	 (IRDM)
AlphaGo vs.	the	world’s	‘Go’	champion
Coulom,	Rémi.	"Whole-history	rating:	A	Bayesian	rating	system	for	
players	of	time-varying	strength."	Computers	and	games.	Springer	
Berlin	Heidelberg,	2008.	113-124.
http://www.goratings.org/
Deep	Learning	helps	Speech	Translation
http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html
http://research.microsoft.com/en-us/about/speech-to-speech-milestones.aspx
Deep	Learning	helps	Computer	Vision
http://blogs.technet.com/b/inside_microsoft_research/archive/2015/02/10/microsoft-
researchers-algorithm-sets-imagenet-challenge-milestone.aspx
Leadertable (ImageNet image	classification)24 Olga Russakov
Image classification
Year Codename Error (percent) 99.9% Conf Int
2014 GoogLeNet 6.66 6.40 - 6.92
2014 VGG 7.32 7.05 - 7.60
2014 MSRA 8.06 7.78 - 8.34
2014 AHoward 8.11 7.83 - 8.39
2014 DeeperVision 9.51 9.21 - 9.82
2013 Clarifai†
11.20 10.87 - 11.53
2014 CASIAWS†
11.36 11.03 - 11.69
2014 Trimps†
11.46 11.13 - 11.80
2014 Adobe†
11.58 11.25 - 11.91
2013 Clarifai 11.74 11.41 - 12.08
2013 NUS 12.95 12.60 - 13.30
2013 ZF 13.51 13.14 - 13.87
2013 AHoward 13.55 13.20 - 13.91
2013 OverFeat 14.18 13.83 - 14.54
2014 Orange†
14.80 14.43 - 15.17
2012 SuperVision†
15.32 14.94 - 15.69
2012 SuperVision 16.42 16.04 - 16.80
2012 ISI 26.17 25.71 - 26.65
2012 VGG 26.98 26.53 - 27.43
2012 XRCE 27.06 26.60 - 27.52
2012 UvA 29.58 29.09 - 30.04
Single-object localization
Year Codename Error (percent) 99.9% Conf Int
2014 VGG 25.32 24.87 - 25.78
2014 GoogLeNet 26.44 25.98 - 26.92
Fig. 10 For each object class, we consider the b
mance of any entry submitted to ILSVRC2012-20
ing entries using additional training data. The
the distribution of these “optimistic” per-class resu
mance is measured as accuracy for image classific
and for single-object localization (middle), and
precision for object detection (right). While the
very promising in image classification, the ILSVR
are far from saturated: many object classes cont
challenging for current algorithms.
dence interval. We run a large number of b
ping rounds (from 20,000 until convergence)
shows the results of the top entries to eac
ILSVRC2012-2014. The winning methods a
tically significantly di↵erent from the other
even at the 99.9% level.Russakovsky O,	Deng	J,	Su	H,	et	al.	Imagenet large	scale	visual	recognition	challenge[J].	International	Journal	of	Computer	Vision,	2015,	115(3):	211-252.
Unofficial	human	error	is	around	5.1% on	a	subset
Why	still	have	a	human	error?	When	labeling,	human	raters	judged	whether	it	belongs	to	a	class	(binary	classification);	however	the	challenge	is	a	1000-class	classification	problem.	
2015			ResNet (ILSVRC’15)	3.57
GoogLeNet,	22 layers	network
Microsoft	ResNet,	a	152 layers	network
U.	of	Toronto,	SuperVision,	 a	7 layers	network
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
Deep	Learning	helps	Natural	Language	Processing
https://code.google.com/p/word2vec/
ualization of Regularities in Word Vector Space
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
princess
prince
heroine
hero
actress
actor
landlady
landlord
female
male
cow
bull
hen
cockqueen
king
she
he
23 / 31
From	bag	of	words to	distributed	representation	of	words	to	decipher	the	sematic	meaning	of	a	
word
Compositionality by Vector Addition
Expression Nearest tokens
Czech + currency koruna, Czech crown, Polish zloty, CTK
Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese
German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa
Russian + river Moscow, Volga River, upriver, Russia
French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
Deep	Learning	helps	Control
http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html#videos
>75%	human	level
‘Deep	Learning’	buzz
‘deep	learning’	is reborn in 2006 and becomes	a	buzzword	recently	with	the	explosion	
of	'big	data',	'data	science',	and	their	applications	mentioned	 in	the	media.
big	data
neural	networks
data	science
deep	learning
What	is	Deep	Learning?
• “deep	learning”	has	its	roots	in	“neural	networks”	
• deep learning creates	many layers (deep) of
neurons,attempting to learn structured representation of
big data, layer by layer
A	computerised	
artificial neuron
Le,	Quoc V.	"Building	high-level	features	using	
large	scale	unsupervised	 learning."	Acoustics,	
Speech	and	Signal	Processing	(ICASSP),	2013	IEEE	
International	Conference	on.	IEEE,	2013.
A	biological	
neuron
Brief	History
• The First	wave
– 1943	McCulloch	and	Pitts	proposed	the	McCulloch-Pitts	
neuron	model
– 1958	Rosenblatt	introduced	the	simple	single	layer	
networks	now	called	Perceptrons.
– 1969	Minsky and	Papert’s book	Perceptrons demonstrated	
the	limitation	of	single	layer	perceptrons,	and	almost	the	
whole	field	went	into	hibernation.
• The Second	wave
– 1986	The	Back-Propagation	learning	algorithm	for	Multi-
Layer	Perceptrons was	rediscovered	and	the	whole	field	
took	off	again.
• The Third	wave
– 2006	Deep	(neural	networks)	Learning	gains	popularity	and
– 2012		made	significant	break-through	in	many	applications.
Summary
• Neural	networks	basics
– its	backpropagation learning	algorithm
• Go	deeper	in	layers
• Convolutional	neural	networks	(CNN)	
– address	feature	dependency
• Recurrent	neural	networks	(RNN)
– address	time-dependency
• Deep	reinforcement	learning
– have	the	ability	of	real-time	control
Summary
• Neural	networks	basics
• Go	deeper	in	layers
• Convolutional	neural	networks	(CNN)
• Recurrent	neural	networks	(RNN)
• Deep	reinforcement	learning
Early	Artificial	Neurons		(1943-1969)
• A	biological	neuron
• In	the	human	brain,	a	typical	
neuron	collects	signals	from	
others	through	a	host	of	fine	
structures	called	dendrites
• The	neuron	sends	out	spikes	
of	electrical	activity	through	
a	long,	thin	stand	known	as	
an	axon,	which	splits	into	
thousands	of	branches
• At	the	end	of	each	branch,	a	
structure	called	a	synapse
converts	the	activity	from	
the	axon	into	electrical	
effects	that	inhibit	or	excite	
activity	in	the	connected	
neurons.
Early	Artificial	Neurons	(1943-1969)
• McCulloch- Pitts	neuron	
• McCulloch	and	Pitts
[1943]	proposed	a	
simple	model	of	a	
neuron	as	computing	
machine
• The	artificial	neuron	
computes	a	weighted	
sum	of	its	inputsfrom	
other	neurons,	and	
outputs	a	one	or	a	zero	
according	to	whether	
the	sum	is	above	or	
below	a	certain	
threshold
x1
x3
x2
f (x) =
1 if x ≥ 0;
0 otherwise
⎧
⎨
⎪
⎩⎪
y
y = f ( wm xm − b
m
∑ )
Early	Artificial	Neurons	(1943-1969)
• Rosenblatt’s	single	layer	perceptron
• Rosenblatt	[1958] further	
proposed	the	perceptron
as	the	first	model	for	
learning	with	a	teacher	
(i.e.,	supervised	learning)
• Focused	on	how	to	find	
appropriate	weights	wm
for	two-class	classification	
task	(y	=	1:	class	one	and	
y	=	-1:	class	two)
y =ϕ( wm xm − b
m
∑ )
ϕ(x) =
1 if x ≥ 0;
−1 otherwise
⎧
⎨
⎪
⎩⎪
ron is built around a nonlinear neuron, namely, the McCulloch–Pitts
From the introductory chapter we recall that such a neural model
ombiner followed by a hard limiter (performing the signum func-
Fig. 1.1. The summing node of the neural model computes a lin-
he inputs applied to its synapses,as well as incorporates an externally
sulting sum, that is, the induced local field, is applied to a hard
Inputs
Hard
limiter
Bias b
x1
x2
xm
wm
w2
w1
w(и)v Output
y•
•
•
Activation	function:
Early	Artificial	Neurons	(1943-1969)
• Rosenblatt’s	single	layer	perceptron
• Proved	the	convergence	of	a	
learning	algorithm if	two	classes	
said	to	be	linearly	separable (i.e.,	
patterns	that	lie	on	opposite	sides	
of	a	hyperplane)
• Many	people	hope	that	such	
machines	could	be	a	basis	for	
artificial	intelligence
y =ϕ( wm xm + b
m
∑ )
ϕ(x) =
1 if x ≥ 0;
−1 otherwise
⎧
⎨
⎪
⎩⎪
1.2 PERCEPTRON
Rosenblatt’s perceptron is built around a nonlinear neuron, namely, the McCulloch–Pitts
model of a neuron. From the introductory chapter we recall that such a neural model
consists of a linear combiner followed by a hard limiter (performing the signum func-
tion), as depicted in Fig. 1.1. The summing node of the neural model computes a lin-
ear combination of the inputs applied to its synapses,as well as incorporates an externally
applied bias. The resulting sum, that is, the induced local field, is applied to a hard
Inputs
Hard
limiter
Bias b
x1
x2
xm
wm
w2
w1
w(и)v Output
y•
•
•
FIGURE 1.1 Signal-flow
graph of the perceptron.
y(t)
=ϕ( wm
(t)
xm
(t)
+ b(t)
m
∑ ), t = {1,...,T}
d(t)
is training label and η is a parameter (the learning rate)
Δwm
(t)
=η(d(t)
− y(t)
)xm
(t)
wm
(t+1)
= wm
(t+1)
+ Δwm
(t)
Haykin S	S,	Haykin S	S,	Haykin S	S,	et	al.	Neural	networks	and	learning	machines[M].	Upper	Saddle	River:	Pearson	Education,	2009.
Early	Artificial	Neurons	(1943-1969)
• The	XOR	problem
• However,	Minsky and	Papert
[1969]	shows	that	some	rather	
elementary	computations,	such	as	
XOR problem,	 could	not	be	done	
by	Rosenblatt’s	one-layer	
perceptron
• However	Rosenblatt	believed	the	
limitations	could	be	overcome	if	
more	layers	of	units	to	be	added,	
but	no	learning	algorithm	known	
to	obtain	the	weights	yet
• Due	to	the	lack	of	learning	
algorithms	people	left	the	neural	
network	paradigm	for	almost	20	
years
Input x Output y
X1 X2 X1 XOR X2
0 0 0
0 1 1
1 0 1
1 1 0
X1
1 true false
false true
0 1 X2
XOR is non linearly separable: These two
classes (true and false) cannot be separated
using a line.
Hidden	layers	and	the	backpropagation (1986~)
• Adding	hidden	layer(s)	(internal	presentation) allows	to	
learn	a	mapping	that	is	not	constrained	by	“linearly	
separable”
decision boundary: x1w1 + x2w2 + b = 0
class 1
class 2
b
x1
x1 y
w1
w2
b
class 1
class 2
class 2
class 2 class 2
x2
x1
y
Each	hidden	node	
realises one	of	the	
lines	bounding	the	
convex	region
Hidden	layers	and	the	backpropagation (1986~)
• But	the	solution	is	quite	often	not	unique		
Input x Output y
X1 X2 X1 XOR X2
0 0 0
0 1 1
1 0 1
1 1 0
The	number	in	the	circle	is	a	threshold
http://recognize-speech.com/basics/introduction-to-artificial-neural-networks
http://www.cs.stir.ac.uk/research/publications/techreps/pdf/TR148.pdf
(solution	 1)
(solution	 2)
ϕ(⋅) is the sign function
two	lines	are	necessary	to	divide	the	sample	
space	accordingly
Hidden	layers	and	the	backpropagation (1986~)
Two-layer	feedforward neural	network
• Feedforward:	massages	moves	forward	from	the	input	nodes,	
through	the	hidden	nodes	(if	any),	and	to	the	output	nodes.	
There	are	no	cycles	or	loops	in	the	network
Parameters
weights
Parameters
weights
Hidden	layers	and	the	backpropagation (1986~)
• One	of	the	efficient	algorithms	for	multi-layer	neural	
networks		is	the Backpropagation algorithm
• It	was	re-introduced	in	1986	and	Neural	Networks	
regained	the	popularity
Note:	backpropagation appears	to	be	found	by	Werbos [1974];	and	then	independently	
rediscovered	around	1985	by	Rumelhart,	Hinton,	and	Williams	[1986]	and	by	Parker	[1985]
Error Caculation∑
Error backpropagation
Parameters
weights
Parameters
weights
Example	Application:	Face	Detection
Figure 2. System for Neural Network based face detection in the compressed domain
Rowley,	Henry	A.,	Shumeet Baluja,	and	Takeo	Kanade.	"Neural	network-based	face	detection."	Pattern	Analysis	and	Machine	Intelligence,	IEEE	
Transactions	on	20.1	(1998):	23-38.
Wang	J,	Kankanhalli M	S,	Mulhem P,	et	al.	Face	detection	using	DCT	coefficients	in	MPEG	video[C]//Proceedings	of	the	International	Workshop	
on	Advanced	Imaging	Technology	(IWAIT	2002).	2002:	66-70.
Example	Application:	Face	Detection
Figure 2. System for Neural Network based face detection in the compressed domain
Eleftheriadis[13] tackle this problem by including
additional face patterns arising from different block
alignment positions as positive training examples. But this
method induces too many variations, not necessarily
belonging to inter-class variations, to the positive training
samples. This in turn makes the high frequency DCT
coefficients unreliable for both face model training and
face detection functions.
Fortunately, there exist fast algorithms to calculate
reconstructed DCT coefficients for overlapping blocks
[16][17][18]. These methods help to calculate DCT
coefficients for scan window steps of even upto one pixel.
However a good tradeoff between speed and accuracy is
obtained by reconstructing the coefficients for scan
window steps of two pixels.
In order to detect faces of different sizes, compressed
Figure 3. Training Data Preparation
Suppose x1, x1, x2, Ö, xn are the DCT coefficients retained
from the feature extraction stage and x1
(j)
,x2
(j)
,Ö,xn
(j)
; j =
1,2,Ö,p are the corresponding DCT coefficients retained
from the training examples where n is the number of
DCT coefficient features retained(Currently, we use 126
DCT coefficient features extracted from 4_4 block sized
square window) and p is the number of training samples.
The upper bounds (Ui) and lower bounds (Li) can be
estimated by
Ui = α*max{1,xi
(1)
,Ö,xi
(p)
}, I = 1,2,Ö,n; (1)
Wang	J,	Kankanhalli M	S,	Mulhem P,	et	al.	Face	detection	using	DCT	coefficients	in	MPEG	video[C]//Proceedings	of	the	International	Workshop	
on	Advanced	Imaging	Technology	(IWAIT	2002).	2002:	66-70.
how	it	works (training)
d1 =1
d2 = 0
x1
x2
xm
y1
Error Caculation∑
Error backpropagation
Parameter
s
weights
Parameter
s
weights
label	=	Face
label	=	no face
Training instances…
y0
yk = f2 (netk
(1)
) = f2 ( wk, j
(2)
hj
(1)
j
∑ )Feed-
forward/predict
ion:
Two-layer	feedforward neural	network
hj
1
hj
1
= f1(netj
(1)
) = f1( wj,m
(1)
xm
m
∑ )
x = (x1,..., xm ) yk
netk
(2)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑
When	make	a	prediciton
x1
x2
xm
wj,m
(1)
wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netk
(2)
net1
(2)
where
netj
(1)
yk = f2 (netk
(1)
) = f2 ( wk, j
(2)
hj
(1)
j
∑ )Feed-
forward/predict
ion:
Two-layer	feedforward neural	network
hj
1
hj
1
= f1(netj
(1)
) = f1( wj,m
(1)
xm
m
∑ )
x = (x1,..., xm ) yk
netk
(2)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑
When	make	a	prediciton
x1
x2
xm
wj,m
(1)
wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netk
(2)
net1
(2)
where
netj
(1)
yk = f2 (netk
(1)
) = f2 ( wk, j
(2)
hj
(1)
j
∑ )Feed-
forward/predict
ion:
Two-layer	feedforward neural	network
hj
1
hj
1
= f1(netj
(1)
) = f1( wj,m
(1)
xm
m
∑ )
x = (x1,..., xm ) yk
netk
(2)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑
When	make	a	prediciton
x1
x2
xm
wj,m
(1)
wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netk
(2)
net1
(2)
where
netj
(1)
When	backprop/learn	the	parameters
E(W) =
1
2
(yk − dk
k
∑ )2
Δwk, j
(2)
≡ −η
∂E(W)
∂wk, j
(2)
= −η(yk − dk )
∂yk
∂netk
(2)
∂netk
(2)
∂wk, j
(2)
=η(dk − yk ) f(2) '(netk
(2)
)hj
(1)
=ηδkhj
(1)
Two	layer	feedforward neural	network
x1
x2
xm
wj,m
(1)
wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netj
(1)
netk
(2)
net1
(2)
netk
(2)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑Notation:
output
neuron:	
wk, j
(2)
= wk, j
(2)
+ Δwk, j
(2)
Error :δk ≡ (dk − yk ) f(2) '(netk
(2)
)
Δwk, j
(2)
=ηErrorkOutputj =ηδkhj
(1)
δk = (dk − yk ) f(2) '(netk
(2)
)
dk − yk
Backprop to learn	the	parameters:
η : the given learning rate
When	backprop/learn	the	parameters
E(W) =
1
2
(yk − dk
k
∑ )2
Δwk, j
(2)
≡ −η
∂E(W)
∂wk, j
(2)
= −η(yk − dk )
∂yk
∂netk
(2)
∂netk
(2)
∂wk, j
(2)
=η(dk − yk ) f(2) '(netk
(2)
)hj
(1)
=ηδkhj
(1)
Two	layer	feedforward neural	network
hj
(1)
x1
x2
xm
wj,m
(1)
wk, j
(2)
y1
yk
h1
(1)
h2
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netj
(1)
netk
(2)
net1
(2)
netk
(1)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑Notation:
output
neuron:	
wk, j
(2)
= wk, j
(2)
+ Δwk, j
(2)
Error :δk ≡ (dk − yk ) f(2) '(netk
(2)
)
Δwk, j
(2)
=ηErrorkOutputj =ηδkhj
(1)
δk = (dk − yk ) f(2) '(netk
(2)
)
dk − yk
Backprop to learn	the	parameters:
η : the given learning rate
Multi-layer	feedforward neural	network
E(W) =
1
2
(yk − dk
k
∑ )2
Two	layer	feedforward neural	network
x1
x2
xm
wj,m
(1) wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netj
(1)
netk
(2)
net1
(2)
netk
(1)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑Notation:
hidden
neuron:	
wj,m
(1)
= Δwj,m
(1)
+ Δwj,m
(1)
Error :δj ≡ f(1) '(netj
(1)
) (dk − yk ) f(2) '(netk
(2)
)wk, j
(2)
]
k
∑ = f(1) '(netj
(1)
) δkwk, j
(2)
k
∑
Δwj,m
(1)
=ηErrorjOutputm =ηδj xm
δk
Δwj,m
(1)
= −η
∂E(W)
∂wj,m
(1)
= −η
∂E(W)
∂hj
(1)
∂hj
(1)
∂wj,m
(1)
=η[ (dk − yk ) f(2) '(netk
(1)
)wk, j
(2)
]
k
∑ [xm f(1) '(netj
(1)
)]=ηδj xm
δj = f(1) '(netj
(1)
) δkwk, j
(2)
k
∑
δ1
Backprop to learn	the	parameters:
η : the given learning rate
Multi-layer	feedforward neural	network
E(W) =
1
2
(yk − dk
k
∑ )2
Two	layer	feedforward neural	network
x1
x2
xm
wj,m
(1) wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
labelsnet1
(1)
net2
(1)
netj
(1)
netk
(2)
net1
(2)
netk
(1)
≡ wk, j
(2)
hj
(1)
j
∑netj
(1)
≡ wj,m
(1)
xm
m
∑Notation:
hidden
neuron:	
wj,m
(1)
= Δwj,m
(1)
+ Δwj,m
(1)
Error :δj ≡ f(1) '(netj
(1)
) (dk − yk ) f(2) '(netk
(2)
)wk, j
(2)
]
k
∑ = f(1) '(netj
(1)
) δkwk, j
(2)
k
∑
Δwj,m
(1)
=ηErrorjOutputm =ηδj xm
δk
Δwj,m
(1)
= −η
∂E(W)
∂wj,m
(1)
= −η
∂E(W)
∂hj
(1)
∂hj
(1)
∂wj,m
(1)
=η[ (dk − yk ) f(2) '(netk
(1)
)wk, j
(2)
]
k
∑ [xm f(1) '(netj
(1)
)]=ηδj xm
δj = f(1) '(netj
(1)
) δkwk, j
(2)
k
∑
δ1
Backprop to learn	the	parameters:
η : the given learning rate
An	example	for	BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation.
1. Calculate errors of output neurons
δα = outα (1 - outα) (Targetα - outα)
δ = out (1 - out ) (Target - out )
Inputs Hidden layer Outputs
A
B
C
Ω
λ
α
β
WAα
WAβ
WBα
WBβ
WCα
WCβ
WΩA
WλA
WΩB
WλB
WΩC
WλC
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
An	example	for	BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation.
1. Calculate errors of output neurons
δα = outα (1 - outα) (Targetα - outα)
δβ = outβ (1 - outβ) (Targetβ - outβ)
2. Change output layer weights
W+
Aα = WAα + ηδα outA W+
Aβ = WAβ + ηδβ outA
W+
Bα = WBα + ηδα outB W+
Bβ = WBβ + ηδβ outB
W+
Cα = WCα + ηδα outC W+
Cβ = WCβ + ηδβ outC
3. Calculate (back-propagate) hidden layer errors
δA = outA (1 – outA) (δαWAα + δβWAβ)
δB = outB (1 – outB) (δαWBα + δβWBβ)
δC = outC (1 – outC) (δαWCα + δβWCβ)
4. Change hidden layer weights
W+
λA = WλA + ηδA inλ W+
ΩA = W+
ΩA + ηδA inΩ
W+
λB = WλB + ηδB inλ W+
ΩB = W+
ΩB + ηδB inΩ
W+
λC = WλC + ηδC inλ W+
ΩC = W+
ΩC + ηδC inΩ
Inputs Hidden layer Outputs
A
B
C
Ω
λ
α
β
WAα
WAβ
WBα
WBβ
WCα
WCβ
WΩA
WλA
WΩB
WλB
WΩC
WλC
1. Calculate errors of output neurons
δα = outα (1 - outα) (Targetα - outα)
δβ = outβ (1 - outβ) (Targetβ - outβ)
2. Change output layer weights
W+
Aα = WAα + ηδα outA W+
Aβ = WAβ + ηδβ outA
W+
Bα = WBα + ηδα outB W+
Bβ = WBβ + ηδβ outB
W+
Cα = WCα + ηδα outC W+
Cβ = WCβ + ηδβ outC
3. Calculate (back-propagate) hidden layer errors
δA = outA (1 – outA) (δαWAα + δβWAβ)
δB = outB (1 – outB) (δαWBα + δβWBβ)
δC = outC (1 – outC) (δαWCα + δβWCβ)
4. Change hidden layer weights
W+
λA = WλA + ηδA inλ W+
ΩA = W+
ΩA + ηδA inΩ
W+
λB = WλB + ηδB inλ W+
ΩB = W+
ΩB + ηδB inΩ
W+
λC = WλC + ηδC inλ W+
ΩC = W+
ΩC + ηδC inΩ
The constant η (called the learning rate, and nominally equal to one) is put
up or slow down the learning if required.
Inputs Hidden layer Outputs
C
WCα
WCβ
WΩC
WλC
Consider	sigmoid	
activation	function
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
fSigmoid (x) =
1
1+ e−x
δk = (dk − yk ) f(2) '(netk
(2)
)
Δwk, j
(2)
=ηErrorkOutputj =ηδkhj
(1)
δj = f(1) '(netj
(1)
) δkwk, j
(2)
k
∑
Δwj,m
(1)
=ηErrorjOutputm =ηδj xm
Let	us	do	some	calculation
To illustrate this let’s do a worked Example.
Worked example 3.1:
Consider the simple network below:
Assume that the neurons have a Sigmoid activation function and
(i) Perform a forward pass on the network.
(ii) Perform a reverse pass (training) once (target = 0.5).
(iii) Perform a further forward pass and comment on the result.
Answer:
(i)
Input
A = 0.35
Input
B = 0.9
Output
0.1
0.8
0.6
0.4
0.3
0.9
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
Let	us	do	some	calculation
Assume that the neurons have a Sigmoid activation function and
(i) Perform a forward pass on the network.
(ii) Perform a reverse pass (training) once (target = 0.5).
(iii) Perform a further forward pass and comment on the result.
Answer:
(i)
Input to top neuron = (0.35x0.1)+(0.9x0.8)=0.755. Out = 0.68.
Input to bottom neuron = (0.9x0.6)+(0.35x0.4) = 0.68. Out = 0.6637.
Input to final neuron = (0.3x0.68)+(0.9x0.6637) = 0.80133. Out = 0.69.
(ii)
Output error δ=(t-o)(1-o)o = (0.5-0.69)(1-0.69)0.69 = -0.0406.
New weights for output layer
w1+
= w1+(δ x input) = 0.3 + (-0.0406x0.68) = 0.272392.
w2+
= w2+(δ x input) = 0.9 + (-0.0406x0.6637) = 0.87305.
Errors for hidden layers:
δ1 = δ x w1 = -0.0406 x 0.272392 x (1-o)o = -2.406x10-3
δ2= δ x w2 = -0.0406 x 0.87305 x (1-o)o = -7.916x10-3
New hidden layer weights:
w3+
=0.1 + (-2.406 x 10-3
x 0.35) = 0.09916.
w4+
= 0.8 + (-2.406 x 10-3
x 0.9) = 0.7978.
w5+
= 0.4 + (-7.916 x 10-3
x 0.35) = 0.3972.
w6+
= 0.6 + (-7.916 x 10-3
x 0.9) = 0.5928.
(iii)
Old error was -0.19. New error is -0.18205. Therefore error has reduced.
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
Active	functions
https://theclevermachine.wordpress.com/tag/tanh-function/
fSigmoid (x)
flinear (x)
ftanh (x) f 'tanh (x)
f 'Sigmoid (x)
f 'linear (x)
Activation	functions
• Logistic	Sigmoid:
fSigmoid (x) =
1
1+e−x
• Output range [0,1]
• Motivated	by	biological	neurons	and	can	
be	interpreted	as	the	probability	 of	an	
artificial	neuron	“firing”	given	its	inputs
• However, saturatedneurons make
gradients vanished	(why?)
Its derivative:
f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
Activation	functions
• Tanh function
ftanh (x) =
sinh(x)
cosh(x)
=
ex
− e−x
ex
+ e−x
• Output	range [-1,1]
• Thus	strongly	negative	inputs	to	the	tanh
will	map	to	negative	outputs.
• Only	zero-valued	inputs	are	mapped	to	
near-zero	outputs
• These	properties	make	the	network	less	
likely	to	get	“stuck”	during	training
Its gradient:
ftanh '(x) =1− ftanh '(x)
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]
- zero centered (nice)
- still kills gradients when saturated :(
[LeCun et al., 199
https://theclevermachine.wordpress.com/tag/tanh-function/
Active	Functions
• ReLU (rectified	linear	unit)
fReLU (x) = max(0,x) • Another	version	is	
Noise	ReLU:		
• ReLU can	be	approximated	by	
softplus function	
• ReLU gradient	doesn't	vanish	as	we	
increase	x
• It canbe used to model positive number
• It is fast as no	need	for	computing	the	
exponential	function
• It	eliminates	the	necessity	to	have	a	
“pretraining” phase
The	derivative:
for	x>0,	f’=1,	and	x<0,	g’=0.
fsoftplus (x) = log(1+ex
)
fnoisyReLU (x) = Max(0, x + N(0,δ(x)))
http://static.googleusercontent.com/media/research.google.
com/en//pubs/archive/40811.pdf
Active	Functions
• ReLU (rectified	linear	unit)
fReLU (x) = max(0,x)
ReLU can	be	approximated	by softplus function	
fsoftplus (x) = log(1+ex
)
http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf
• The	only	non-linearity	comes	from	
the	path	selection	with	individual	
neurons	being	active	or	not	
• It	allows sparse	representations:
• for	a given	input	only	a	subset	
of	neurons	are active
Sparse	propagation	of	activations	and	gradientsAdditional active functions:
Leaky ReLU,Exponential LU,Maxout etc
Error/Loss	function
• Recall	stochastic	gradient	descent
– update	from	a	randomly	picked	example	(but	in	
practice	do	a	batch	update)
• Squared	error loss	for	one	binary	output:
w = w −η
∂E(w)
∂w
E(w) =
1
2
(dlabel − f (x;w))2
f (x;w))x
input output
Error/Loss	function
• Softmax (cross-entropy loss) for	multiple	classes:
whereE(w) = − (dk log yk
k
∑ +(1− dk )log(1− yk ))
x1
x2
wj,m
(1) wk, j
(2)
y1
yk
h1
(1)
h2
(1)
hj
(1)
d1
dk
f(1)∑
∑
∑
f(1)
f(1)
∑
∑
f(2)
f(2)
One	hot	encoded class	labels
net1
(1)
net2
(1)
netj
(1)
netk
(2)
net1
(2)
yk =
exp( wk, j
(2)
hj
(1)
j
∑ )
exp( wk', j
(2)
hj
(1)
j
∑ )
k'
∑
(Class	labels	follow	multinomial	distribution)
•It	requires	labeled	training	data.
Most	data	is	unlabeled.
•The	learning	time	does	not	scale	well
It	is	very	slow	in	networks	with	
multiple	hidden	layers.
•It	can	get	stuck	in	poor	local	optima.
What	is	wrong	with	backpropagation?
Backpropagation:	conclusion
• Multi	layer	Perceptron can	be	trained	by	
the	back	propagation	algorithm	to	
perform	any	mapping	between	the	input	
and	the	output.
Advantages
Slide based on Geoffrey	Hinton’s
Hornik,	Kurt.	"Approximation	capabilities	of	multilayer	feedforwardnetworks."	Neural	networks	4.2	
(1991):	251-257.
http://www.infoworld.com/article/3003315/big-data/deep-
learning-a-brief-guide-for-practical-problem-solvers.html
Summary
• Neural	networks	basic	and	its	
backpropagation learning	algorithm
• Go	deeper	in	layers
• Convolutional	neural	networks	(CNN)
• Recurrent	neural	networks	(RNN)
• Deep	reinforcement	learning
Why	more	than	one	hidden	layer?
http://www.rsipvision.com/exploring-deep-learning/
Why	goes	deep	now?
• Big	data	everywhere
• Demands	for:	
– Unsupervised	learning:	most	of	them	are	
unlabeled	
– End-to-end: trainable	feature	solution
• time	consuming	for	hand-crafted	features
• Improved	computational	power:
– Widely	used	GPUs	for	computing
– Distributed	computing	(mapreduce &	spark)
Deep	learning	problem
• Difficult	to	train
– Many	many	Parameters	
• Two	solutions:
– Regularised by	prior	knowledge	(since	1989	)
• Convolutional	Neural	Networks	(CNN)
– New	method	of	pre-training	using	unsupervised	
methods	(since	2006)
• Stacked	auto-encoders
• Stacked Restricted	Boltzmann	Machines
New	methods	for	unsupervised	pre-training
“2006”
• New methods for unsupervised pre-training
13
• Stacked RBM’s (Deep Belief Networks [DBN’s] )
• Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning
algorithm for deep belief nets. Neural Computation,
18:1527-1554.
• Hinton, G. E. and Salakhutdinov, R. R, Reducing the
dimensionality of data with neural networks. Science, Vol. 313.
no. 5786, pp. 504 - 507, 28 July 2006.
• (Stacked) Autoencoders
• Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007).
Greedy Layer-Wise Training of Deep Networks, Advances in
Neural Information Processing Systems 19
Stacked Restricted	Boltzmann	Machines
Stacked Autoencoders
Restricted	Boltzmann	Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A	RBM	is	a	bipartite (two disjointedset)undirected	graph	and	used	as	probabilistic	
generative	models
– contains	m visible	units	V =	(V1,	...,	Vm)	to	represent	observable	data	and		
– n	hidden	units	H =	(H1,	...,	Hn)	to	capture	their	dependencies
• Assume	the	random	variables	(V	,	H)	take	values	(v,	h)	∈ {0,	1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint	distribution:	 ,	where	the	energy	function
sig():	sigmoid
We	learn	the	parameters	wij by	maximising
a	log-likelihood	given	example	v:	
No	connections	within	a	layer	 (restricted)	leads	to
<*>
data
: expectations	under	the	data	empirical	distribution
<*>
model
: expectations	under	the	model
Hinton,	Geoffrey.	"A	practical	guide	to	training	restricted	Boltzmann	
machines."	Momentum	9.1	(2010):	926.
Restricted	Boltzmann	Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A	RBM	is	a	bipartite (two disjointedset)undirected	graph	and	used	as	probabilistic	
generative	models
– contains	m visible	units	V =	(V1,	...,	Vm)	to	represent	observable	data	and		
– n	hidden	units	H =	(H1,	...,	Hn)	to	capture	their	dependencies
• Assume	the	random	variables	(V	,	H)	take	values	(v,	h)	∈ {0,	1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint	distribution:	 ,	where	the	energy	function
sig():	sigmoid
We	learn	the	parameters	wij by	maximising
a	log-likelihood	given	example	v:	
No	connections	within	a	layer	 (restricted)	leads	to
<*>
data
: expectations	under	the	data	empirical	distribution
<*>
model
: expectations	under	the	model
Hinton,	Geoffrey.	"A	practical	guide	to	training	restricted	Boltzmann	
machines."	Momentum	9.1	(2010):	926.
Restricted	Boltzmann	Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A	RBM	is	a	bipartite (two disjointedset)undirected	graph	and	used	as	probabilistic	
generative	models
– contains	m visible	units	V =	(V1,	...,	Vm)	to	represent	observable	data	and		
– n	hidden	units	H =	(H1,	...,	Hn)	to	capture	their	dependencies
• Assume	the	random	variables	(V	,	H)	take	values	(v,	h)	∈ {0,	1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint	distribution:	 ,	where	the	energy	function
sig():	sigmoid
We	learn	the	parameters	wij by	maximising
a	log-likelihood	given	example	v:	
No	connections	within	a	layer	 (restricted)	leads	to
<*>
data
: expectations	under	the	data	empirical	distribution
<*>
model
: expectations	under	the	model
Hinton,	Geoffrey.	"A	practical	guide	to	training	restricted	Boltzmann	
machines."	Momentum	9.1	(2010):	926.
Restricted	Boltzmann	Machines (RBM)
= Z e−E(v,h)
with the energy function
h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (20)
} and j ∈ {1, ..., m}, wij is a real valued weight associated
een units Vj and Hi and bj and ci are real valued bias terms
jth visible and the ith hidden variable, respectively.
ected graph of an RBM with n hidden and m visible variables
RBM has only connections between the layer of hidden and
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated
tite undirected graph as shown in Fig. 1. It consists of m visib
(V1, ..., Vm) to represent observable data and n hidden units H
to capture dependencies between observed variables. In binary RB
in this tutorial, the random variables (V , H) take values (v, h)
and the joint probability distribution under the model is given
distribution p(v, h) = 1
Z e−E(v,h)
with the energy function
E(v, h) = −
n
i=1
m
j=1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi .
For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei
with the edge between units Vj and Hi and bj and ci are real valu
associated with the jth visible and the ith hidden variable, respe
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
• A	RBM	is	a	bipartite (two disjointedset)undirected	graph	and	used	as	probabilistic	
generative	models
– contains	m visible	units	V =	(V1,	...,	Vm)	to	represent	observable	data	and		
– n	hidden	units	H =	(H1,	...,	Hn)	to	capture	their	dependencies
• Assume	the	random	variables	(V	,	H)	take	values	(v,	h)	∈ {0,	1}m+n
i
This equation shows why a (marginalized) RBM can be regarded as a product of expert
58], in which a number of “experts” for the individual components of the observations a
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visib
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
58], in which a number of “experts” for the individual components of the observations
multiplicatively.
Any distribution on {0, 1}m
can be modeled arbitrarily well by an RBM with m visi
hidden units, where k denotes the cardinality of the support set of the target distribution
number of input elements from {0, 1}m
that have a non-zero probability of being observe
been shown recently that even fewer units can be sufficient, depending on the patterns in
set [41].
4.1 RBMs and neural networks
The RBM can be interpreted as a stochastic neural network, where the nodes and edges c
neurons and synaptic connections, respectively. The conditional probability of a single v
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa
sig(x) = 1/(1 + e−x
), because
p(Hi = 1 | v) = sig
m
j=1
wijvj + ci
and
p(Vj = 1 | h) = sig
n
i=1
wijhi + bj .
To see this, let v−l denote the state of all visible units except the lth one and let us defi
αl(h) = −
n
i=1
wilhi − bl
es
is an MRF associated with a bipartite undirected graph as
V = (V1, . . . , Vm) representing the observable data, and n
e dependencies between the observed variables. In binary
variables (V , H) take values (v, h) ∈ {0, 1}m+n
and the
el is given by the Gibbs distribution p(v, h) = 1
Z e−E(v,h)
1
wijhivj −
m
j=1
bjvj −
n
i=1
cihi . (18)
is a real valued weight associated with the edge between
alued bias terms associated with the jth visible and the
Z =
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be rais
and biases to lower the energy of that image and to raise the energy of oth
that have low energies and therefore make a big contribution to the partitio
of the log probability of a training vector with respect to a weight is surp
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the dis
subscript that follows. This leads to a very simple learning rule for perf
ascent in the log probability of the training data:
wij = ✏(hvihjidata hvihjimodel)
where ✏ is a learning rate.
p(v, h) =
1
Z
e E(v,h)
where the “partition function”, Z, is given by summing over all possible pair
vectors:
Z =
X
v,h
e E(v,h)
The probability that the network assigns to a visible vector, v, is given by su
hidden vectors:
p(v) =
1
Z
X
h
e E(v,h)
The probability that the network assigns to a training image can be raised b
and biases to lower the energy of that image and to raise the energy of other i
that have low energies and therefore make a big contribution to the partition f
of the log probability of a training vector with respect to a weight is surprisin
@ log p(v)
@wij
= hvihjidata hvihjimodel
where the angle brackets are used to denote expectations under the distrib
subscript that follows. This leads to a very simple learning rule for perform
Joint	distribution:	 ,	where	the	energy	function
sig():	sigmoid
We	learn	the	parameters	wij by	maximising
a	log-likelihood	given	example	v:	
No	connections	within	a	layer	 (restricted)	leads	to
<*>
data
: expectations	under	the	data	empirical	distribution
<*>
model
: expectations	under	the	model
Hinton,	Geoffrey.	"A	practical	guide	to	training	restricted	Boltzmann	
machines."	Momentum	9.1	(2010):	926.
Can	be	approximated	by	Contrastive	Divergence
Stacked	RBM
• The	learned	feature	 activations	 of	one	RBM	are	used	as	the	‘‘data’’	 for	training	the	next	RBM	in	the	stack.	
• After	the	 pretraining,	the	RBMs	are	‘‘unrolled’’	 to	create	 a	deep	autoencoder,	 which	is	then	fine-tuned	 using	backpropagation
of	error	derivatives.
Hinton,	Geoffrey	E.,	and	Ruslan R.	Salakhutdinov.	"Reducing	the	dimensionality	of	data	with	neural	networks."	Science	313.5786	(2006):	504-507.
Stacked	RBM
• The	learned	feature	 activations	 of	one	RBM	are	used	as	the	‘‘data’’	 for	training	the	next	RBM	in	the	stack.	
• After	the	 pretraining,	the	RBMs	are	‘‘unrolled’’	 to	create	 a	deep	autoencoder,	 which	is	then	fine-tuned	 using	backpropagation
of	error	derivatives.
Hinton,	Geoffrey	E.,	and	Ruslan R.	Salakhutdinov.	"Reducing	the	dimensionality	of	data	with	neural	networks."	Science	313.5786	(2006):	504-507.
• Unsupervised learning resultsFig. 4. (A) The fraction of
retrieved documents in the
same class as the query when
a query document from the
test set is used to retrieve other
test set documents, averaged
over all 402,207 possible que-
ries. (B) The codes produced
by two-dimensional LSA. (C)
The codes producedby a2000-
500-250-125-2 autoencoder.
28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org6
tion of
in the
y when
om the
e other
veraged
le que-
oduced
SA. (C)
a2000-
ncoder.
28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org
The	codes	produced	by	two-dimensional	LSA (Latent Semantic Analysis).
The	codes	produced	by	a	2000- 500-250-125-2	StackedRBM
Auto-encoder
• Multi-layer	neural	network	with	target	output	and	input:
– Reconstruction	xo =	decoder(encoder(input	xi))
– Minimizing	reconstruction	error		(xi - xo)2
http://deeplearning.net/tutorial/dA.html
h = f (We
T
xi + be )
xo = f (Wd
T
h + bd )
encoder decoder
xo
h
xi
f :sigmoidSetting : Tied weights We
T
= Wd ,
• The	goal	is	to	learn	We,	be,	bd such	that
the	average	reconstruction	error	is	minimised
• If	f is	chosen	to	be	linear,	auto-encoder	is	
equivalent	to	PCA;	if	non-linear,	it	captures	
multi-modal	aspects	of	the	input	
distribution
Denoising auto-encoder
• A stochastic	version	of	the	auto-encoder.
• We	train	the	autoencoder to	reconstruct	the	input	from	a	
corrupted	version of	it	,	so	that
– force	the	hidden	layer	to	discover	more	robust	features	and	
– prevent	it	from	simply	learning	the	identity
http://deeplearning.net/tutorial/dA.html
encoder decoder
xo
y
Corrupted	xi
• The	stochastic	corruption	process :	
randomly	sets	some	of	the	inputs	(as	
many	as	half	of	them)	to	zero
• The	goal	is	 to	predict	the	missing	values	
from	the	uncorrupted	 (i.e.,	non-missing)	
values
Stacked	Denoising Auto-encoder
Pretraining: Stacked Denoising Auto-encoder
15
• Stacking Auto-Encoders
from: Bengio ICML 2009
Parameter	matrix	W’	is	transpose	of	W
Bengio Y,	Lamblin P,	Popovici D,	et	al.	Greedy	layer-wise	training	of	deep	networks[J].	Advances	in	neural	information	
processing	systems,	2007,	19:	153.
Unsupervised	greedy	layer-wise	training	procedure
(1) The first	layer	
pre-training
(2) The second
layer	pre-training
(3) The while network fine-
tuning by label
Convolutional	neural	networks:
Receptive	field
• Receptive	field: Neurons in the
retina respond to light stimulus in
restricted regions of the visual field
• Animal experiments on receptive	
fields	of	two	retinal	ganglion	cells
– Fields	are	circular	areas	of	the	
retina
– The	cell	(upper	 part) responds	
when	the	center	is	illuminated	and	
the	surround	is	darkened.	
– The	cell	(lower	part) responds	
when	the	center	is	darkened	and	
the	surround	is	illuminated.	
– Both	cells	give	on- and	off-
responses	when	both	center	and	
surround	 are	illuminated,	but	
neither	response	is	as	strong	as	
when	only	center	or	surround	 is	
illuminated
Hubel	D.H. :	The	Visual	Cortex	of	the	Brain Sci Amer 209:54-62,	1963
Contributed	by	Hubel	and	Wiesel	for	the	studies	
of	the	visual	system	of	a	cat
“On”	Center	Field
“Off”	Center	Field
Light	On
Convolutional	neural	networks
• Sparse	connectivity	by local	
correlation
– Filter:	the	input	of	a	hidden	unit	in	
layer	m are	from	a	subset	of	units	
in	layer	m-1 that	have	spatially	
connected	receptive	fields
• Shared	weights
– each	filter	is	replicated	across	the	
entire	visual	field.	These	
replicated	units	share	the	same	
weights	and	form	a	feature	map.
http://deeplearning.net/tutorial/lenet.html
Translational invariance with convolutional neural networks
e previous section, we see that using locality structures significantly reduces the number of connections
n further reduction can be achieved via another technique called “weight sharing.” In weight sharing
e of the parameters in the model are constrained to be equal to each other. For example, in the following
of a network, we have the following constraints w1 = w4 = w7, w2 = w5 = w8 and w3 = w6 = w9
es that have the same color have the same weight):
edges	that	have	the	same	color	have	the	same	weight
2-d	case	(subscripts	are	weights)
1-d	case
m	layer
m-1	layer
m-1	layer
one	filter	
at	m layer
LeNet 5
INPUT
32x32
Convolutions SubsamplingConvolutions
C1: feature maps
6@28x28
Subsampling
S2: f. maps
6@14x14
S4: f. maps 16@5x5
C5: layer
120
C3: f. maps 16@10x10
F6: layer
84
Full connection
Full connection
Gaussian connections
OUTPUT
10
LeCun Y,	Boser B,	Denker J	S,	et	al.	Backpropagation applied	to	handwritten	zip	code	recognition.	
Neural	computation,	1989,	1(4):	541-551.
• LeNet 5	consists	of	7	layers,	all	of	which	contain	trainable	weights;	
• End-to end solution: they	are	used	to	replace	hand	crafted	feature	extraction
• Cx:	convolution	 layers with	5x5	filter
• Sx:	subsampling (pooling)	 layers
• Fx:	fully-connected	layers
• Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image
LeNet 5:	Convolution	Layer
f
Input	image Feature	map10x10 filter 3x3 8x8
Activation	
function
61Convolutions
Y.	LeCun,	 L.	Bottou,	 Y.	Bengio,	 and	P.	Haffner.	"Gradient-based	 learning	applied	to	document	 recognition."	 Proceedings	 of	the	IEEE,	86(11):2278-2324,	 November	1998
LeNet 5:	Convolution	Layer
62
Activation	
functionInput	image Feature	map10x10 8x8
Response
kernel 3x3
f
f
f
Convolutions
• Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image
• 3 different filters (weights are different) lead to 3 8x8 out images.
Y.	LeCun,	 L.	Bottou,	 Y.	Bengio,	 and	P.	Haffner.	"Gradient-based	 learning	applied	to	document	 recognition."	 Proceedings	 of	the	IEEE,	86(11):2278-2324,	 November	1998
LeNet 5:	(Sampling)	Pooling	Layer
Max	pooling
Max	in	a	2x2	
filter
Average	pooling
Average	in	a	
2x2	filter
63Sampling
• Pooling: partitions	the	input	image	into	a	set	of	non-overlapping	 rectangles	and,	
for	each	such	sub-region,	outputs	 the	maximum or average value.
Max pooling
• reduces	computation and
• is a	way	of	taking	the	most	
responsive	node	of	the	given	
interest	region,
• but	may	result	in	loss	of	
accurate	spatial	information
Y.	LeCun,	 L.	Bottou,	 Y.	Bengio,	 and	P.	Haffner.	"Gradient-based	 learning	applied	to	document	 recognition."	 Proceedings	 of	the	IEEE,	86(11):2278-2324,	 November	1998
LeNet 5	results
• MNIST (handwritten	digits) Dataset:
– 60k training and 10k test examples
• Test error rate 0.95%
Training error (no distortions)
Test error (no distortions)
Test error
(with distortions)
Training Set Size (x1000)
20 30 40 50 60 70 80 90 100
ate (%)
4−>6 3−>5 8−>2 2−>1 5−>3 4−>8 2−>8 3−>5 6−>5 7−>3
9−>4 8−>0 7−>8 5−>3 8−>7 0−>6 3−>7 2−>7 8−>3 9−>4
8−>2 5−>3 4−>8 3−>9 6−>0 9−>8 4−>9 6−>1 9−>4 9−>1
9−>4 2−>0 6−>1 3−>5 3−>2 9−>5 6−>0 6−>0 6−>0 6−>8
4−>6 7−>3 9−>4 4−>6 2−>7 9−>7 4−>3 9−>4 9−>4 9−>4
8−>7 4−>2 8−>4 3−>5 8−>4 6−>5 8−>5 3−>8 3−>8 9−>8
1−>5 9−>8 6−>3 0−>2 6−>5 9−>5 0−>7 1−>6 4−>9 2−>1
2−>8 8−>5 4−>9 7−>2 7−>2 6−>5 9−>7 6−>1 5−>6 5−>0
4−>9 2−>8
Total only 82 errors from LeNet-5. correct
answer left and right is the machine answer.
4
8
3
C1 S2 C3 S4 C5
F6
Output
3
4 4 4
4
34
8
3
C1 S2 C3 S4 C5
F6
Output
4
8
3
C1 S2 C3 S4 C5
F6
Output
3
4 4 4
4
34
8
3
C1 S2 C3 S4 C5
F6
Output
Y.	LeCun,	 L.	Bottou,	 Y.	Bengio,	 and	P.	Haffner.	"Gradient-based	 learning	applied	to	document	 recognition."	 Proceedings	 of	the	IEEE,	86(11):2278-2324,	 November	1998
http://yann.lecun.com/exdb/mnist/
From	MNIST	to	ImageNet
• ImageNet
– Over	15M	labeled	high	
resolution	images
– Roughly	22K	categories
– Collected	from	web	and	
labeled	by	Amazon	
Mechanical	Turk
• The	Image/scene	
classification challenge
– Image/scene	
classification
– Metric:	Hit@5	error	rate	
- make	5	guesses	about	
the	image	label
http://cognitiveseo.com/blog/6511/will-google-read-rank-images-near-future/
12 Olga Russakovsky* et al.
Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from
ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using
scene-level queries.
tage of all the positive examples available. The second
source (24%) is negative images which were part of the
original ImageNet collection process but voted as neg-
ative: for example, some of the images were collected
from Flickr and search engines for the ImageNet synset
“animals” but during the manual verification step did
not collect enough votes to be considered as containing
an “animal.” These images were manually re-verified
for the detection task to ensure that they did not in
fact contain the target objects. The third source (13%)
is images collected from Flickr specifically for the de-
tection task. These images were added for ILSVRC2014
following the same protocol as the second type of images
in the validation and test set. This was done to bring
the training and testing distributions closer together.
Russakovsky O,	Deng	J,	Su	H,	et	al.	Imagenet large	scale	visual	recognition	challenge[J].	International	Journal	of	Computer	Vision,	2015,	115(3):	211-252.
Hit@5	error	rate
Leadertable (ImageNet image	classification)24 Olga Russakov
Image classification
Year Codename Error (percent) 99.9% Conf Int
2014 GoogLeNet 6.66 6.40 - 6.92
2014 VGG 7.32 7.05 - 7.60
2014 MSRA 8.06 7.78 - 8.34
2014 AHoward 8.11 7.83 - 8.39
2014 DeeperVision 9.51 9.21 - 9.82
2013 Clarifai†
11.20 10.87 - 11.53
2014 CASIAWS†
11.36 11.03 - 11.69
2014 Trimps†
11.46 11.13 - 11.80
2014 Adobe†
11.58 11.25 - 11.91
2013 Clarifai 11.74 11.41 - 12.08
2013 NUS 12.95 12.60 - 13.30
2013 ZF 13.51 13.14 - 13.87
2013 AHoward 13.55 13.20 - 13.91
2013 OverFeat 14.18 13.83 - 14.54
2014 Orange†
14.80 14.43 - 15.17
2012 SuperVision†
15.32 14.94 - 15.69
2012 SuperVision 16.42 16.04 - 16.80
2012 ISI 26.17 25.71 - 26.65
2012 VGG 26.98 26.53 - 27.43
2012 XRCE 27.06 26.60 - 27.52
2012 UvA 29.58 29.09 - 30.04
Single-object localization
Year Codename Error (percent) 99.9% Conf Int
2014 VGG 25.32 24.87 - 25.78
2014 GoogLeNet 26.44 25.98 - 26.92
Fig. 10 For each object class, we consider the b
mance of any entry submitted to ILSVRC2012-20
ing entries using additional training data. The
the distribution of these “optimistic” per-class resu
mance is measured as accuracy for image classific
and for single-object localization (middle), and
precision for object detection (right). While the
very promising in image classification, the ILSVR
are far from saturated: many object classes cont
challenging for current algorithms.
dence interval. We run a large number of b
ping rounds (from 20,000 until convergence)
shows the results of the top entries to eac
ILSVRC2012-2014. The winning methods a
tically significantly di↵erent from the other
even at the 99.9% level.Russakovsky O,	Deng	J,	Su	H,	et	al.	Imagenet large	scale	visual	recognition	challenge[J].	International	Journal	of	Computer	Vision,	2015,	115(3):	211-252.
Unofficial	human	error	is	around	5.1% on	a	subset
Why	human	error	 still?	When	labeling,	 human	raters	 judged	whether	 it	belongs	to	a	class	(binary	classification);	 the	challenge	is	a	1000-class	classification	 problem.	
2015			ResNet (ILSVRC’15)	3.57
GoogLeNet,	22 layers	network
Microsoft	ResNet,	a	152 layers	network
U.	of	Toronto,	SuperVision,	 a	7 layers	network
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
SuperVision
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥
Krizhevsky A,	Sutskever I,	Hinton	G	E.	Imagenet classification	with	deep	convolutional	neural	networks,	
Advances	in	neural	information	processing	systems.	2012:	1097-1105.
• Rooted	from	leNet-5
• Rectified	Linear	Units,	overlapping	 pooling,	 dropout trick
• Randomly	extracted	224x224	patches	from	256x256	images	for	more	training	data
dropout
• Dropout randomly	‘drops’	 units	from	a	layer	on	each	training	step,	creating	‘sub-
architectures’	within	the	model.	
• It	can	be	viewed	as	a	type	of	sampling	of	a	smaller	network	within	a	larger	
network
• Prevent	Neural	Networks	from	Overfitting
Srivastava,	Nitish,	et	al.	"Dropout:	A	simple	way	to	prevent	neural	networks	from	
overfitting."	The	Journal	of	Machine	Learning	Research	15.1	(2014):	1929-1958.
Go	deeper	with	CNN
• GoogLeNet,	a	22 layers	deep	network
• Increasing	the	size	of	NN	results	in	two	difficulties
– 1)	overfittingand	2)	high	computation	cost
• Solution:	a	sparse deep	neural	network	
heclassifier(pre-
ainclassifier,but
orkisdepictedin
ingtheDistBe-
temusingmod-
m.Althoughwe
aroughestimate
uldbetrainedto
ithinaweek,the
Ourtrainingused
with0.9momen-
reasingthelearn-
eraging[13]was
encetime.
gedsubstantially
ion,andalready
eroptions,some-
parameters,such
ore,itishardto
ctivesingleway
ersfurther,some
errelativecrops,
ill,oneprescrip-
fterthecompeti-
atchesoftheim-
en8%and100%
inedtotheinter-
metricdistortions
batoverfittingto
nChallenge
ngeinvolvesthe
00leaf-nodecat-
reabout1.2mil-
tionand100,000
ociatedwithone
measuredbased
ons.Twonum-
racyrate,which
tpredictedclass,
thegroundtruth
mageisdeemed
amongthetop-5,
geusesthetop-5
input
Conv
7x7+2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
1x1+1(V)
Conv
3x3+1(S)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
AveragePool
7x7+1(V)
FC
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure3:GoogLeNetnetworkwithallthebellsandwhistles.
AveragePool
7x7+1(V)
FC
SoftmaxActivation
softmax2
input
Conv
7x7+2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
1x1+1(V)
Conv
3x3+1(S)
https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Figure3:GoogLeNetnetworkwithallthebellsandwhistles
7
Their	filters	 banks	
concatenated	 into	a	single	
output	 vector	for	the	next	
layer
ResNet:	Deep	Residual	Learning
• Overfitting prevents	a	neural	network	to	go	deeper
• A	Thought	experiment:	
– We	have	a	shallower	architecture(e.g.	logistic	regression)
– Add	more	layers	onto	it	to	construct	its	deeper	counterpart
– So	a	deeper	model	should	produce	no	higher	training	error	than	
its	shallower	counterpart,	if	
• the	added	layers	are	identity	mapping,	and	the	other	layers	are	copied	
from	the	learned	shallower	model.	
• However,	it	is	reported	that	adding	more	layers	to	a	very	
deep	model	leads	to	higher	training	error
He,	Kaiming,	et	al.	"Deep	Residual	Learning	for	Image	Recognition."	arXiv preprint	arXiv:1512.03385	(2015).
https://github.com/KaimingHe/deep-residual-networks
al Learning for Image Recognition
angyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
xiangz, v-shren, jiansun}@microsoft.com
to train. We
the training
n those used
ers as learn-
er inputs, in-
provide com-
hese residual
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
trainingerror(%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
testerror(%)
56-layer
20-layer
56-layer
20-layer
Figure 1. Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error. Similar phenomena
esidual Learning for Image Recognition
Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
e difficult to train. We
rk to ease the training
eeper than those used
e the layers as learn-
to the layer inputs, in-
ons. We provide com-
ng that these residual
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
trainingerror(%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
testerror(%)
56-layer
20-layer
56-layer
20-layer
Figure 1. Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error. Similar phenomena
• Solution:	feedforwardneural	networks	with	“shortcut	connections”
• This	leads	to	incredible	152-layer deep	convolution	network,	driving	the	
error	to	3.57% in	the classification!
ResNet:	Deep	Residual	Learning
identity
weight layer
weight layer
relu
relu
F(x) + x
x
F(x)
x
Figure 2. Residual learning: a building block.
re comparably good or better than the constructed solution
or unable to do so in feasible time).
In this paper, we address the degradation problem by
ntroducing a deep residual learning framework. In-
ead of hoping each few stacked layers directly fit a
esired underlying mapping, we explicitly let these lay-
rs fit a residual mapping. Formally, denoting the desired
nderlying mapping as H(x), we let the stacked nonlinear
ayers fit another mapping of F(x) := H(x) x. The orig-
nal mapping is recast into F(x)+x. We hypothesize that it
easier to optimize the residual mapping than to optimize
ImageNet test set, and won the 1st place in
2015 classification competition. The extrem
resentations also have excellent generalization
on other recognition tasks, and lead us to fu
1st places on: ImageNet detection, ImageNe
COCO detection, and COCO segmentation i
COCO 2015 competitions. This strong eviden
the residual learning principle is generic, and w
it is applicable in other vision and non-vision
2. Related Work
Residual Representations. In image recogn
[18] is a representation that encodes by the re
with respect to a dictionary, and Fisher Vecto
formulated as a probabilistic version [18] of
of them are powerful shallow representations
trieval and classification [4, 48]. For vector
encoding residual vectors [17] is shown to b
tive than encoding original vectors.
He,	Kaiming,	et	al.	"Deep	Residual	Learning	for	Image	Recognition."	arXiv preprint	arXiv:1512.03385	(2015).
https://github.com/KaimingHe/deep-residual-networks
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
image
3x3conv,512
3x3conv,64
3x3conv,64
pool,/2
3x3conv,128
3x3conv,128
pool,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
pool,/2
3x3conv,512
3x3conv,512
3x3conv,512
pool,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
pool,/2
fc4096
fc4096
fc1000
image
output
size:112
output
size:224
output
size:56
output
size:28
output
size:14
output
size:7
output
size:1
VGG-1934-layerplain
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
image
34-layerresidual
Figure3.ExamplenetworkarchitecturesforImageNet.Left:the
VGG-19model[41](19.6billionFLOPs)asareference.Mid-
dle:aplainnetworkwith34parameterlayers(3.6billionFLOPs).
Right:aresidualnetworkwith34parameterlayers(3.6billion
FLOPs).Thedottedshortcutsincreasedimensions.Table1shows
moredetailsandothervariants.
ResidualN
insertshort
networkint
shortcuts(E
outputareo
Fig.3).Wh
inFig.3),w
performsid
forincreasi
parameter;(
matchdime
options,wh
sizes,theya
3.4.Imple
Ourimp
in[21,41].
domlysamp
A224⇥224
horizontalfl
standardco
normalizati
beforeactiv
asin[13]an
useSGDw
startsfrom0
andthemod
useaweigh
donotused
Intesting
10-croptes
convolution
atmultiple
sideisin{2
4.Experi
4.1.Image
Weevalu
cationdatas
aretrained
atedonthe
resultonth
Weevaluate
PlainNetw
plainnets.T
18-layerpla
tailedarchit
Theresu
nethashigh
plainnet.T
paretheirtr
cedure.We
4
1.	the	desired		underlying	 mapping	as	H(x).	
2.	In	stead	of	fitting	H(x),	a	stack	of	nonlinear	layers fit	
another	mapping	of	F(x)	:=	H(x)−x.	
3.	The	original	mapping	is	recast	into	F(x)+x
4.	During	training,	it	would	be	easy	to	push	the	
residual	(F(x))	to	zero	(create	an	identity	mapping)	 by	
a	stack	of	nonlinear	layers.
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
plain-20
plain-32
plain-44
plain-56
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-11056-layer
20-layer
110-layer
20-layer
Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing erro
of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 120
Summary
• Neural	networks	basic	and	its	
backpropagation learning	algorithm
• Go	deeper	in	layers
• Convolutional	neural	networks	(CNN)
• Recurrent	neural	networks	(RNN)
• Deep	reinforcement	learning
Recurrent	neural	networks
• Why	we	need	another	NN	model
– Sequential	prediction
– Time	series
• A	simple	recurrent	neural	network	(RNN)
– Training can be done by BackPropagation Through Time (BPTT)
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
s = f (xU)
o = f (sV)
x :input vector, o :output vector,
s : hidden state vector,
U : layer 1 param. matrix,
V : layer 2 param. matrix,
f : tanh or ReLU
Two-layer	feedforward network
Add	time-dependency	
of	the	hidden	state	s
W : State transition param. matrix
st+1 = f (xt+1U+ stW)
ot+1 = f (st+1V)
LSTMs	(Long	Short	Term	Memory	networks)
• Standard Recurrent Network(SRN) has	a	problem	of	learning	long-term	
dependency	(due	to	vanishing	gradients of saturated nodes)
• a	LSTM provides	another	way	to	compute	a	hidden	state	
Hochreiter,	Sepp,	and	Jürgen	Schmidhuber.	"Long	short-term	memory."	Neural	computation	9.8	(1997):	1735-1780.
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
st−1 st
xt
ct−1 ct
ot
...
f
i o
input	gate
forget	gate
output	gate
“candidate” hidden state:
Cell internal memory
Hidden state
σ :sigmoid (control signal between 0 and 1); o: elementwise multiplication
st−1
xt
ot...
st
st = tanh(xtU+ st−1W)
SRN cell
LSTM cell
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• ClockWork-RNN	 is	similar	to	a	simple	RNN	with	an	input,	output	and	hidden	layer
• Difference	lies	in
– The	hidden	layer	is	partitioned	into	g	modules	each	with	its	own	clock	rate
– Neurons	in	faster	module	are	connected	to	neurons	in	a	slower	module
RNN applications: time	series
Koutnik,	Jan,	et	al.	"A	clockwork	rnn."	arXiv preprint	arXiv:1402.3511	(2014).
A Clockwork RNN
Figure 1. CW-RNN architecture is similar to a simple RNN with an input, output and hidden layer. The hidden layer is partitioned into g
modules each with its own clock rate. Within each module the neurons are fully interconnected. Neurons in faster module i are connected
to neurons in a slower module j only if a clock period Ti < Tj.
2. Related Work
Contributions to sequence modeling and recognition that are
relevant to CW-RNNs are presented in this section. The pri-
mary focus is on RNN extensions that deal with the problem
of bridging long time lags.
Hierarchical Recurrent Neural Networks (Hihi & Bengio,
1996) are the most similar to CW-RNNs work, having both
delayed connections and units operating at different time-
scales. Five hand-designed architectures were tested on
have been very successful recently in speech and handwrit-
ing recognition (Graves et al., 2005; 2009; Sak et al., 2014).
Stacking LSTMs (Fernandez et al., 2007; Graves &
Schmidhuber, 2009) into a hierarchical sequence proces-
sor, equipped with Connectionist Temporal Classification
(CTC; Graves et al., 2006), performs simultaneous segmen-
tation and recognition of sequences. Its deep variant cur-
rently holds the state-of-the-art result in phoneme recogni-
tion on the TIMIT database (Graves et al., 2013).
A Clockwork RNN
Word	embedding
• From bag of word to word embedding
– Use	a	a	real-valued	vector	in	Rm to	represent	a	word	(concept)
v(‘‘cat")=(0.2,	-0.4,	0.7,	...)
v(‘‘mat")=(0.0,	0.6,	-0.1,	...)
• Continuous	bag	of	word	(CBOW) model	(word2vec)
– Input/output	 words	x/y are	one-hot	encoded
– Hidden	layer	is	shared	for	all	input	words
Mikolov,	Tomas,	et	al.	"Efficient	estimation	of	word	representations	in	vector	space."	arXiv preprint	arXiv:1301.3781	 (2013).
Continuous	bag	of	word	(CBOW) model
Rong,	Xin.	"word2vec	parameter	learning	explained."	arXiv preprint	arXiv:1411.2738	(2014).
where C is the number of words in the context, w1, · · · , wC are the words the
and vw is the input vector of a word w. The loss function is
E = = log p(wO|wI,1, · · · , wI,C)
= uj⇤ + log
VX
j0=1
exp(uj0 )
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h)
which is the same as (7), the objective of the one-word-context model, ex
di↵erent, as defined in (18) instead of (1).
Input layer
Hidden layer
Output layer
WV×N
WV×N
WV×N
W'N×V yjhi
x2k
x1k
xCk
C×V-dim
N-dim
V-dim
Figure 2: Continuous bag-of-word model
The update equation for the hidden!output weights stay the same a
one-word-context model (11). We copy it here:
v0
wj
(new)
= v0
wj
(old)
⌘ · ej · h for j = 1, 2, · · · , V.
Note that we need to apply this to every element of the hidden!output we
each training instance.
prediction error, the more significant e↵ects a word will exert on the movement on the
input vector of the context word.
As we iteratively update the model parameters by going through context-target word
pairs generated from a training corpus, the e↵ects on the vectors will accumulate. We
can imagine that the output vector of a word w is “dragged” back-and-forth by the input
vectors of w’s co-occurring neighbors, as if there are physical strings between the vector
of w and the vectors of its neighbors. Similarly, an input vector can also be considered as
being dragged by many output vectors. This interpretation can remind us of gravity, or
force-directed graph layout. The equilibrium length of each imaginary string is related to
the strength of cooccurrence between the associated pair of words, as well as the learning
rate. After many iterations, the relative positions of the input and output vectors will
eventually stabilize.
1.2 Multi-word context
Figure 2 shows the CBOW model with a multi-word context setting. When computing
the hidden layer output, instead of directly copying the input vector of the input context
word, the CBOW model takes the average of the vectors of the input context words, and
use the product of the input!hidden weight matrix and the average vector as the output.
h =
1
C
W · (x1 + x2 + · · · + xC) (17)
=
1
C
· (vw1 + vw2 + · · · + vwC ) (18)
5
Hidden	nodes:
The	cross-entropy	loss:
j
j0=1
j
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h) (21)
which is the same as (7), the objective of the one-word-context model, except that h is
di↵erent, as defined in (18) instead of (1).
Figure 2: Continuous bag-of-word model
The update equation for the hidden!output weights stay the same as that for the
one-word-context model (11). We copy it here:
v0
wj
(new)
= v0
wj
(old)
⌘ · ej · h for j = 1, 2, · · · , V. (22)
Note that we need to apply this to every element of the hidden!output weight matrix for
each training instance.
6
The update equation for input!hidden weights is similar to (16), except that now we
need to apply the following equation for every word wI,c in the context:
v(new)
wI,c
= v(old)
wI,c
1
C
· ⌘ · EH for c = 1, 2, · · · , C. (23)
where vwI,c is the input vector of the c-th word in the input context; ⌘ is a positive learning
rate; and EH = @E
@hi
is given by (12). The intuitive understanding of this update equation
is the same as that for (16).
The	gradient	updates:
N-dim	Vector	
representation	
of	a	word
V:	vocabulary	size;	
C:	num.	input	words;
v:	row	vector	of	input	 matrix	W;	
v’:	row	vector	of	output	 matrix	W’
where C is the number of words in the context, w1, · · · , wC are the words the in the context,
and vw is the input vector of a word w. The loss function is
E = = log p(wO|wI,1, · · · , wI,C) (19)
= uj⇤ + log
VX
j0=1
exp(uj0 ) (20)
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h) (21)
which is the same as (7), the objective of the one-word-context model, except that h is
di↵erent, as defined in (18) instead of (1).
where C is the number of words in the context, w1, · · · , wC are the words the in the context,
and vw is the input vector of a word w. The loss function is
E = = log p(wO|wI,1, · · · , wI,C) (19)
= uj⇤ + log
VX
j0=1
exp(uj0 ) (20)
= v0
wO
T
· h + log
VX
j0=1
exp(v0
wj
T
· h) (21)
which is the same as (7), the objective of the one-word-context model, except that h is
di↵erent, as defined in (18) instead of (1).
or
v0
wj
(new)
= v0
wj
(old)
⌘ · ej · h for j = 1, 2, · · · , V.
where ⌘ > 0 is the learning rate, ej = yj tj, and hi is the i-th unit in the hidden
v0
wj
is the output vector of wj. Note that this update equation implies that we ha
go through every possible word in the vocabulary, check its output probability yj
compare yj with its expected output tj (either 0 or 1). If yj > tj (“overestimating”)
we subtract a proportion of the hidden vector h (i.e., vwI ) from v0
wO
, thus making
farther away from vwI ; if yj < tj (“underestimating”), we add some h to v0
wO
, thus m
v0
wO
closer3 to vwI . If yj is very close to tj, then according to the update equation
little change will be made to the weights. Note, again, that vw (input vector) an
(output vector) are two di↵erent vector representations of the word w.
Update equation for input!hidden weights
Having obtained the update equations for W0, we can now move on to W. We tak
derivative of E on the output of the hidden layer, obtaining
@E
@hi
=
VX
j=1
@E
@uj
·
@uj
@hi
=
VX
j=1
ej · w0
ij := EHi
where hi is the output of the i-th unit of the hidden layer; uj is defined in (2), th
input of the j-th unit in the output layer; and ej = yj tj is the prediction error
Remarkable	properties	from	Word	embedding
• Simple	algebraic	operations	with	the	word	vectors
v(‘‘woman")−v(‘‘man")	 ≃ v(‘‘aunt")−v(‘‘uncle")
v(‘‘woman")−v(‘‘man")	 ≃ v(‘‘queen")−v(‘‘king")
Mikolov,	Tomas,	Wen-tau	Yih,	and	Geoffrey	Zweig.	"Linguistic	Regularities	in	Continuous	Space	Word	Representations."	HLT-NAACL.	2013.
Figure 2: Left panel shows vector offsets for three word
pairs illustrating the gender relation. Right panel shows
a different projection, and the singular/plural relation for
two words. In high-dimensional space, multiple relations
can be embedded for a single word.
provided. We have explored several related meth-
ods and found that the proposed method performs
well for both syntactic and semantic relations. We
note that this measure is qualitatively similar to rela-
tional similarity model of (Turney, 2012), which pre-
dicts similarity between members of the word pairs
(xb, xd), (xc, xd) and dis-similarity for (xa, xd).
6 Experimental Results
To evaluate the vector offset method, we used
vectors generated by the RNN toolkit of Mikolov
M
LS
LS
LS
RN
RN
RN
RN
Tabl
diffe
M
RN
CW
CW
HL
HL
Tabl
lobe
Log
Figure 2: Left panel shows vector offsets for three word
pairs illustrating the gender relation. Right panel shows
a different projection, and the singular/plural relation for
two words. In high-dimensional space, multiple relations
can be embedded for a single word.
provided. We have explored several related meth-
ods and found that the proposed method performs
well for both syntactic and semantic relations. We
note that this measure is qualitatively similar to rela-
tional similarity model of (Turney, 2012), which pre-
dicts similarity between members of the word pairs
(xb, xd), (xc, xd) and dis-similarity for (xa, xd).
6 Experimental Results
Vector	offsets	for	
gender	 relation
The	singular/plural	 relation	 for	
two	words
Word	the	relationship	is	defined	by	subtracting	two	word	vectors,	and	the	result	is	added	to	
another	word.	Thus	for	example,	Paris	- France	+	Italy	=	Rome.
Using	X	=	v(”biggest”)	−	v(”big”)	+	v(”small”)	as	query	and	searching	for	the	nearest	word	based	on	
cosine	distance	results	in	v(”smallest”)
Single-prototype English embeddings by Huang
et al. (2012) are used to initialize Chinese em-
beddings. The initialization readily provides a set
(Align-Init) of benchmark embeddings in experi-
ments (Section 4), and ensures translation equiva-
lence in the embeddings at start of training.
3.2.2 Bilingual training
Using the alignment counts, we form alignment
matrices Aen→zh and Azh→en. For Aen→zh, each
row corresponds to a Chinese word, and each col-
umn an English word. An element aij is first as-
signed the counts of when the ith Chinese word is
aligned with the jth English word in parallel text.
After assignments, each row is normalized such that
it sums to one. The matrix Azh→en is defined sim-
ilarly. Denote the set of Chinese word embeddings
as Vzh, with each row a word embedding, and the
set of English word embeddings as Ven. With the
two alignment matrices, we define the Translation
Equivalence Objective:
JTEO-en→zh = ∥Vzh − Aen→zhVen∥2
(3)
JTEO-zh→en = ∥Ven − Azh→enVzh∥2
(4)
We optimize for a combined objective during train-
ing. For the Chinese embeddings we optimize for:
JCO-zh + λJTEO-en→zh (5)
For the English embeddings we optimize for:
JCO-en + λJTEO-zh→en (6)
During bilingual training, we chose the value of λ
such that convergence is achieved for both JCO and
JTEO. A small validation set of word similarities
3.3 Curriculum training
We train 100k-vocabulary word embeddings using
curriculum training (Turian et al., 2010) with Equa-
tion 5. For each curriculum, we sort the vocabu-
lary by frequency and segment the vocabulary by a
band-size taken from {5k, 10k, 25k, 50k}. Separate
bands of the vocabulary are trained in parallel using
minibatch L-BFGS on the Chinese Gigaword cor-
pus 3. We train 100,000 iterations for each curricu-
lum, and the entire 100k vocabulary is trained for
500,000 iterations. The process takes approximately
19 days on a eight-core machine. We show visual-
ization of learned embeddings overlaid with English
in Figure 1. The two-dimensional vectors for this vi-
sualization is obtained with t-SNE (van der Maaten
and Hinton, 2008). To make the figure comprehen-
sible, subsets of Chinese words are provided with
reference translations in boxes with green borders.
Words across the two languages are positioned by
the semantic relationships implied by their embed-
dings.
Figure 1: Overlaid bilingual embeddings: English words
are plotted in yellow boxes, and Chinese words in green;
reference translations to English are provided in boxesZou,	Will	Y.,	et	al.	"Bilingual	Word	Embeddings for	Phrase-Based	Machine	Translation."	EMNLP.	2013.
Neural	Language	models
• n-gram	model	
– Construct	conditional	probabilities	for	the	
next	word,	given	combinations	of	the	last	n-
1 words	(	contexts)
• Neural	language	model
– associate	with	each	word	a	distributed	word	
feature	vector	for	word	embedding,	
– express	the	joint	probability	function	of	
word	sequences	using	those	vectors,	and	
– learn	simultaneously	the	word	feature	
vectors and	the	parameters of	that	
probability	function.	
he training points (e.g., training sentences) is distributed in a larger volume, usually in some form
eighborhood around the training points. In high dimensions, it is crucial to distribute probability
s where it matters rather than uniformly in all directions around each training point. We will
w in this paper that the way in which the approach proposed here generalizes is fundamentally
erent from the way in which previous state-of-the-art statistical language modeling approaches
generalizing.
A statistical model of language can be represented by the conditional probability of the next
d given all the previous ones, since
ˆP(wT
1 ) =
T
∏
t=1
ˆP(wt|wt 1
1 ),
ere wt is the t-th word, and writing sub-sequence w
j
i = (wi,wi+1,··· ,wj 1,wj). Such statisti-
language models have already been found useful in many technological applications involving
ural language, such as speech recognition, language translation, and information retrieval. Im-
vements in statistical language models could thus have a significant impact on such applications.
When building statistical models of natural language, one considerably reduces the difficulty
his modeling problem by taking advantage of word order, and the fact that temporally closer
ds in the word sequence are statistically more dependent. Thus, n-gram models construct ta-
of conditional probabilities for the next word, for each one of a large number of contexts, i.e.
mbinations of the last n 1 words:
ˆP(wt|wt 1
1 ) ⇡ ˆP(wt|wt 1
t n+1).
only consider those combinations of successive words that actually occur in the training cor-
or that occur frequently enough. What happens when a new combination of n words appears
was not seen in the training corpus? We do not want to assign zero probability to such cases,
ause such new combinations are likely to occur, and they will occur even more frequently for
where
1
t=1
1
d, and writing sub-sequence w
j
i = (wi,wi+1,··· ,wj 1,wj). Such statisti-
e already been found useful in many technological applications involving
s speech recognition, language translation, and information retrieval. Im-
anguage models could thus have a significant impact on such applications.
tical models of natural language, one considerably reduces the difficulty
m by taking advantage of word order, and the fact that temporally closer
nce are statistically more dependent. Thus, n-gram models construct ta-
bilities for the next word, for each one of a large number of contexts, i.e.
n 1 words:
ˆP(wt|wt 1
1 ) ⇡ ˆP(wt|wt 1
t n+1).
combinations of successive words that actually occur in the training cor-
ntly enough. What happens when a new combination of n words appears
raining corpus? We do not want to assign zero probability to such cases,
nations are likely to occur, and they will occur even more frequently for
mple answer is to look at the probability predicted using a smaller context
rigram models (Katz, 1987) or in smoothed (or interpolated) trigram mod-
1980). So, in such models, how is generalization basically obtained from
in the training corpus to new sequences of words? A way to understand
ink about a generative model corresponding to these interpolated or back-
ntially, a new sequence of words is generated by “gluing” very short and
gth 1, 2 ... or up to n words that have been seen frequently in the training
ning the probability of the next piece are implicit in the particulars of the
n-gram algorithm. Typically researchers have used n = 3, i.e. trigrams,
-art results, but see Goodman (2001) for how combining many tricks can
BENGIO, DUCHARME, VINCENT AND JAUVIN
softmax
tanh
. . . . . .. . .
. . . . . .
. . . . . .
across words
most computation here
index for index for index for
shared parameters
Matrix
in
look−up
Table
. . .
C
C
wt 1wt 2
C(wt 2) C(wt 1)C(wt n+1)
wt n+1
i-th output = P(wt = i | context)
Figure 1: Neural architecture: f(i,wt 1,··· ,wt n+1) = g(i,C(wt 1),··· ,C(wt n+1
neural network and C(i) is the i-th word feature vector.
parameters of the mapping C are simply the feature vectors themselves, represenBengio,	Yoshua,	et	al.	"Neural	probabilistic	language	models."	Innovations	in	Machine	Learning.	Springer	Berlin	Heidelberg,	2006.	137-186.
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning
Deep Learning

Contenu connexe

Tendances

Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Mostafa G. M. Mostafa
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function범준 김
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Lecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkLecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkParveenMalik18
 
Probabilistic Reasoning
Probabilistic Reasoning Probabilistic Reasoning
Probabilistic Reasoning Sushant Gautam
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewLEE HOSEONG
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationDat Nguyen
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Recurrent and Recursive Networks (Part 1)
Recurrent and Recursive Networks (Part 1)Recurrent and Recursive Networks (Part 1)
Recurrent and Recursive Networks (Part 1)sohaib_alam
 
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Understanding Autoencoder  (Deep Learning Book, Chapter 14)Understanding Autoencoder  (Deep Learning Book, Chapter 14)
Understanding Autoencoder (Deep Learning Book, Chapter 14)Entrepreneur / Startup
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Hakky St
 

Tendances (20)

Mathematical tools in dip
Mathematical tools in dipMathematical tools in dip
Mathematical tools in dip
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Lecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkLecture 6 radial basis-function_network
Lecture 6 radial basis-function_network
 
AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)
 
Probabilistic Reasoning
Probabilistic Reasoning Probabilistic Reasoning
Probabilistic Reasoning
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
Densenet CNN
Densenet CNNDensenet CNN
Densenet CNN
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
convex hull
convex hullconvex hull
convex hull
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Recurrent and Recursive Networks (Part 1)
Recurrent and Recursive Networks (Part 1)Recurrent and Recursive Networks (Part 1)
Recurrent and Recursive Networks (Part 1)
 
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Understanding Autoencoder  (Deep Learning Book, Chapter 14)Understanding Autoencoder  (Deep Learning Book, Chapter 14)
Understanding Autoencoder (Deep Learning Book, Chapter 14)
 
YOLO
YOLOYOLO
YOLO
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 

En vedette

Introduction to deep learning in python and Matlab
Introduction to deep learning in python and MatlabIntroduction to deep learning in python and Matlab
Introduction to deep learning in python and MatlabImry Kissos
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsRoelof Pieters
 
Wsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-biddingWsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-biddingJun Wang
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Machine Learning for Data Mining
Machine Learning for Data MiningMachine Learning for Data Mining
Machine Learning for Data MiningBhuban Roy
 
Deep Learning Class #2 - Deep learning for Images, I See What You Mean
Deep Learning Class #2 - Deep learning for Images, I See What You MeanDeep Learning Class #2 - Deep learning for Images, I See What You Mean
Deep Learning Class #2 - Deep learning for Images, I See What You MeanHolberton School
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalJun Wang
 
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...Jun Wang
 
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingWeinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingJun Wang
 
On Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time AdvertisingOn Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time AdvertisingJun Wang
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana
 
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...Jun Wang
 
Bee algorithm
Bee algorithmBee algorithm
Bee algorithmkousick
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana
 
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...Jun Wang
 
RTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorialRTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorialShuai Yuan
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyayabhishek upadhyay
 

En vedette (20)

Introduction to deep learning in python and Matlab
Introduction to deep learning in python and MatlabIntroduction to deep learning in python and Matlab
Introduction to deep learning in python and Matlab
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Wsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-biddingWsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-bidding
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Machine Learning for Data Mining
Machine Learning for Data MiningMachine Learning for Data Mining
Machine Learning for Data Mining
 
Deep Learning Class #2 - Deep learning for Images, I See What You Mean
Deep Learning Class #2 - Deep learning for Images, I See What You MeanDeep Learning Class #2 - Deep learning for Images, I See What You Mean
Deep Learning Class #2 - Deep learning for Images, I See What You Mean
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information Retrieval
 
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
 
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingWeinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
 
On Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time AdvertisingOn Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time Advertising
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
 
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
 
Bee algorithm
Bee algorithmBee algorithm
Bee algorithm
 
Wsdm2015
Wsdm2015Wsdm2015
Wsdm2015
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres Rodriguez
 
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
 
RTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorialRTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorial
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 

Similaire à Deep Learning

Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Julien SIMON
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeSiby Jose Plathottam
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionIJCSIS Research Publications
 
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINEINTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINEIRJET Journal
 
Rise of AI through DL
Rise of AI through DLRise of AI through DL
Rise of AI through DLRehan Guha
 
1st review android malware.pptx
1st review  android malware.pptx1st review  android malware.pptx
1st review android malware.pptxNambiraju
 
Can AI say from our eyes when we read relevant information?
Can AI say from our eyes when we read relevant information?Can AI say from our eyes when we read relevant information?
Can AI say from our eyes when we read relevant information?Nilavra Bhattacharya
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadehSmart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadehnabati
 
Genetic Algorithms and Programming - An Evolutionary Methodology
Genetic Algorithms and Programming - An Evolutionary MethodologyGenetic Algorithms and Programming - An Evolutionary Methodology
Genetic Algorithms and Programming - An Evolutionary Methodologyacijjournal
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language ProcessingBerlin Language Technology
 
A.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in EngineeringA.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in EngineeringAnatoly Levenchuk
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architectureijaia
 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecturegerogepatton
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecturegerogepatton
 

Similaire à Deep Learning (20)

Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Null
NullNull
Null
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Deep learning and computer vision
Deep learning and computer visionDeep learning and computer vision
Deep learning and computer vision
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINEINTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
INTELLIGENT MALWARE DETECTION USING EXTREME LEARNING MACHINE
 
Rise of AI through DL
Rise of AI through DLRise of AI through DL
Rise of AI through DL
 
1st review android malware.pptx
1st review  android malware.pptx1st review  android malware.pptx
1st review android malware.pptx
 
Can AI say from our eyes when we read relevant information?
Can AI say from our eyes when we read relevant information?Can AI say from our eyes when we read relevant information?
Can AI say from our eyes when we read relevant information?
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadehSmart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
Smart manufacturing through cloud based-r-nabati--dr abdulbaghi ghaderzadeh
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Genetic Algorithms and Programming - An Evolutionary Methodology
Genetic Algorithms and Programming - An Evolutionary MethodologyGenetic Algorithms and Programming - An Evolutionary Methodology
Genetic Algorithms and Programming - An Evolutionary Methodology
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing
 
A.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in EngineeringA.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in Engineering
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Deep Learning

  • 5. Leadertable (ImageNet image classification)24 Olga Russakov Image classification Year Codename Error (percent) 99.9% Conf Int 2014 GoogLeNet 6.66 6.40 - 6.92 2014 VGG 7.32 7.05 - 7.60 2014 MSRA 8.06 7.78 - 8.34 2014 AHoward 8.11 7.83 - 8.39 2014 DeeperVision 9.51 9.21 - 9.82 2013 Clarifai† 11.20 10.87 - 11.53 2014 CASIAWS† 11.36 11.03 - 11.69 2014 Trimps† 11.46 11.13 - 11.80 2014 Adobe† 11.58 11.25 - 11.91 2013 Clarifai 11.74 11.41 - 12.08 2013 NUS 12.95 12.60 - 13.30 2013 ZF 13.51 13.14 - 13.87 2013 AHoward 13.55 13.20 - 13.91 2013 OverFeat 14.18 13.83 - 14.54 2014 Orange† 14.80 14.43 - 15.17 2012 SuperVision† 15.32 14.94 - 15.69 2012 SuperVision 16.42 16.04 - 16.80 2012 ISI 26.17 25.71 - 26.65 2012 VGG 26.98 26.53 - 27.43 2012 XRCE 27.06 26.60 - 27.52 2012 UvA 29.58 29.09 - 30.04 Single-object localization Year Codename Error (percent) 99.9% Conf Int 2014 VGG 25.32 24.87 - 25.78 2014 GoogLeNet 26.44 25.98 - 26.92 Fig. 10 For each object class, we consider the b mance of any entry submitted to ILSVRC2012-20 ing entries using additional training data. The the distribution of these “optimistic” per-class resu mance is measured as accuracy for image classific and for single-object localization (middle), and precision for object detection (right). While the very promising in image classification, the ILSVR are far from saturated: many object classes cont challenging for current algorithms. dence interval. We run a large number of b ping rounds (from 20,000 until convergence) shows the results of the top entries to eac ILSVRC2012-2014. The winning methods a tically significantly di↵erent from the other even at the 99.9% level.Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. Unofficial human error is around 5.1% on a subset Why still have a human error? When labeling, human raters judged whether it belongs to a class (binary classification); however the challenge is a 1000-class classification problem. 2015 ResNet (ILSVRC’15) 3.57 GoogLeNet, 22 layers network Microsoft ResNet, a 152 layers network U. of Toronto, SuperVision, a 7 layers network http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
  • 6. Deep Learning helps Natural Language Processing https://code.google.com/p/word2vec/ ualization of Regularities in Word Vector Space −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 princess prince heroine hero actress actor landlady landlord female male cow bull hen cockqueen king she he 23 / 31 From bag of words to distributed representation of words to decipher the sematic meaning of a word Compositionality by Vector Addition Expression Nearest tokens Czech + currency koruna, Czech crown, Polish zloty, CTK Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa Russian + river Moscow, Volga River, upriver, Russia French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
  • 8. ‘Deep Learning’ buzz ‘deep learning’ is reborn in 2006 and becomes a buzzword recently with the explosion of 'big data', 'data science', and their applications mentioned in the media. big data neural networks data science deep learning
  • 9. What is Deep Learning? • “deep learning” has its roots in “neural networks” • deep learning creates many layers (deep) of neurons,attempting to learn structured representation of big data, layer by layer A computerised artificial neuron Le, Quoc V. "Building high-level features using large scale unsupervised learning." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013. A biological neuron
  • 10. Brief History • The First wave – 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model – 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons. – 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. • The Second wave – 1986 The Back-Propagation learning algorithm for Multi- Layer Perceptrons was rediscovered and the whole field took off again. • The Third wave – 2006 Deep (neural networks) Learning gains popularity and – 2012 made significant break-through in many applications.
  • 11. Summary • Neural networks basics – its backpropagation learning algorithm • Go deeper in layers • Convolutional neural networks (CNN) – address feature dependency • Recurrent neural networks (RNN) – address time-dependency • Deep reinforcement learning – have the ability of real-time control
  • 12. Summary • Neural networks basics • Go deeper in layers • Convolutional neural networks (CNN) • Recurrent neural networks (RNN) • Deep reinforcement learning
  • 13. Early Artificial Neurons (1943-1969) • A biological neuron • In the human brain, a typical neuron collects signals from others through a host of fine structures called dendrites • The neuron sends out spikes of electrical activity through a long, thin stand known as an axon, which splits into thousands of branches • At the end of each branch, a structure called a synapse converts the activity from the axon into electrical effects that inhibit or excite activity in the connected neurons.
  • 14. Early Artificial Neurons (1943-1969) • McCulloch- Pitts neuron • McCulloch and Pitts [1943] proposed a simple model of a neuron as computing machine • The artificial neuron computes a weighted sum of its inputsfrom other neurons, and outputs a one or a zero according to whether the sum is above or below a certain threshold x1 x3 x2 f (x) = 1 if x ≥ 0; 0 otherwise ⎧ ⎨ ⎪ ⎩⎪ y y = f ( wm xm − b m ∑ )
  • 15. Early Artificial Neurons (1943-1969) • Rosenblatt’s single layer perceptron • Rosenblatt [1958] further proposed the perceptron as the first model for learning with a teacher (i.e., supervised learning) • Focused on how to find appropriate weights wm for two-class classification task (y = 1: class one and y = -1: class two) y =ϕ( wm xm − b m ∑ ) ϕ(x) = 1 if x ≥ 0; −1 otherwise ⎧ ⎨ ⎪ ⎩⎪ ron is built around a nonlinear neuron, namely, the McCulloch–Pitts From the introductory chapter we recall that such a neural model ombiner followed by a hard limiter (performing the signum func- Fig. 1.1. The summing node of the neural model computes a lin- he inputs applied to its synapses,as well as incorporates an externally sulting sum, that is, the induced local field, is applied to a hard Inputs Hard limiter Bias b x1 x2 xm wm w2 w1 w(и)v Output y• • • Activation function:
  • 16. Early Artificial Neurons (1943-1969) • Rosenblatt’s single layer perceptron • Proved the convergence of a learning algorithm if two classes said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane) • Many people hope that such machines could be a basis for artificial intelligence y =ϕ( wm xm + b m ∑ ) ϕ(x) = 1 if x ≥ 0; −1 otherwise ⎧ ⎨ ⎪ ⎩⎪ 1.2 PERCEPTRON Rosenblatt’s perceptron is built around a nonlinear neuron, namely, the McCulloch–Pitts model of a neuron. From the introductory chapter we recall that such a neural model consists of a linear combiner followed by a hard limiter (performing the signum func- tion), as depicted in Fig. 1.1. The summing node of the neural model computes a lin- ear combination of the inputs applied to its synapses,as well as incorporates an externally applied bias. The resulting sum, that is, the induced local field, is applied to a hard Inputs Hard limiter Bias b x1 x2 xm wm w2 w1 w(и)v Output y• • • FIGURE 1.1 Signal-flow graph of the perceptron. y(t) =ϕ( wm (t) xm (t) + b(t) m ∑ ), t = {1,...,T} d(t) is training label and η is a parameter (the learning rate) Δwm (t) =η(d(t) − y(t) )xm (t) wm (t+1) = wm (t+1) + Δwm (t) Haykin S S, Haykin S S, Haykin S S, et al. Neural networks and learning machines[M]. Upper Saddle River: Pearson Education, 2009.
  • 17. Early Artificial Neurons (1943-1969) • The XOR problem • However, Minsky and Papert [1969] shows that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt’s one-layer perceptron • However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet • Due to the lack of learning algorithms people left the neural network paradigm for almost 20 years Input x Output y X1 X2 X1 XOR X2 0 0 0 0 1 1 1 0 1 1 1 0 X1 1 true false false true 0 1 X2 XOR is non linearly separable: These two classes (true and false) cannot be separated using a line.
  • 18. Hidden layers and the backpropagation (1986~) • Adding hidden layer(s) (internal presentation) allows to learn a mapping that is not constrained by “linearly separable” decision boundary: x1w1 + x2w2 + b = 0 class 1 class 2 b x1 x1 y w1 w2 b class 1 class 2 class 2 class 2 class 2 x2 x1 y Each hidden node realises one of the lines bounding the convex region
  • 19. Hidden layers and the backpropagation (1986~) • But the solution is quite often not unique Input x Output y X1 X2 X1 XOR X2 0 0 0 0 1 1 1 0 1 1 1 0 The number in the circle is a threshold http://recognize-speech.com/basics/introduction-to-artificial-neural-networks http://www.cs.stir.ac.uk/research/publications/techreps/pdf/TR148.pdf (solution 1) (solution 2) ϕ(⋅) is the sign function two lines are necessary to divide the sample space accordingly
  • 20. Hidden layers and the backpropagation (1986~) Two-layer feedforward neural network • Feedforward: massages moves forward from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network Parameters weights Parameters weights
  • 21. Hidden layers and the backpropagation (1986~) • One of the efficient algorithms for multi-layer neural networks is the Backpropagation algorithm • It was re-introduced in 1986 and Neural Networks regained the popularity Note: backpropagation appears to be found by Werbos [1974]; and then independently rediscovered around 1985 by Rumelhart, Hinton, and Williams [1986] and by Parker [1985] Error Caculation∑ Error backpropagation Parameters weights Parameters weights
  • 22. Example Application: Face Detection Figure 2. System for Neural Network based face detection in the compressed domain Rowley, Henry A., Shumeet Baluja, and Takeo Kanade. "Neural network-based face detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 20.1 (1998): 23-38. Wang J, Kankanhalli M S, Mulhem P, et al. Face detection using DCT coefficients in MPEG video[C]//Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT 2002). 2002: 66-70.
  • 23. Example Application: Face Detection Figure 2. System for Neural Network based face detection in the compressed domain Eleftheriadis[13] tackle this problem by including additional face patterns arising from different block alignment positions as positive training examples. But this method induces too many variations, not necessarily belonging to inter-class variations, to the positive training samples. This in turn makes the high frequency DCT coefficients unreliable for both face model training and face detection functions. Fortunately, there exist fast algorithms to calculate reconstructed DCT coefficients for overlapping blocks [16][17][18]. These methods help to calculate DCT coefficients for scan window steps of even upto one pixel. However a good tradeoff between speed and accuracy is obtained by reconstructing the coefficients for scan window steps of two pixels. In order to detect faces of different sizes, compressed Figure 3. Training Data Preparation Suppose x1, x1, x2, Ö, xn are the DCT coefficients retained from the feature extraction stage and x1 (j) ,x2 (j) ,Ö,xn (j) ; j = 1,2,Ö,p are the corresponding DCT coefficients retained from the training examples where n is the number of DCT coefficient features retained(Currently, we use 126 DCT coefficient features extracted from 4_4 block sized square window) and p is the number of training samples. The upper bounds (Ui) and lower bounds (Li) can be estimated by Ui = α*max{1,xi (1) ,Ö,xi (p) }, I = 1,2,Ö,n; (1) Wang J, Kankanhalli M S, Mulhem P, et al. Face detection using DCT coefficients in MPEG video[C]//Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT 2002). 2002: 66-70.
  • 24. how it works (training) d1 =1 d2 = 0 x1 x2 xm y1 Error Caculation∑ Error backpropagation Parameter s weights Parameter s weights label = Face label = no face Training instances… y0
  • 25. yk = f2 (netk (1) ) = f2 ( wk, j (2) hj (1) j ∑ )Feed- forward/predict ion: Two-layer feedforward neural network hj 1 hj 1 = f1(netj (1) ) = f1( wj,m (1) xm m ∑ ) x = (x1,..., xm ) yk netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑ When make a prediciton x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netk (2) net1 (2) where netj (1)
  • 26. yk = f2 (netk (1) ) = f2 ( wk, j (2) hj (1) j ∑ )Feed- forward/predict ion: Two-layer feedforward neural network hj 1 hj 1 = f1(netj (1) ) = f1( wj,m (1) xm m ∑ ) x = (x1,..., xm ) yk netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑ When make a prediciton x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netk (2) net1 (2) where netj (1)
  • 27. yk = f2 (netk (1) ) = f2 ( wk, j (2) hj (1) j ∑ )Feed- forward/predict ion: Two-layer feedforward neural network hj 1 hj 1 = f1(netj (1) ) = f1( wj,m (1) xm m ∑ ) x = (x1,..., xm ) yk netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑ When make a prediciton x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netk (2) net1 (2) where netj (1)
  • 28. When backprop/learn the parameters E(W) = 1 2 (yk − dk k ∑ )2 Δwk, j (2) ≡ −η ∂E(W) ∂wk, j (2) = −η(yk − dk ) ∂yk ∂netk (2) ∂netk (2) ∂wk, j (2) =η(dk − yk ) f(2) '(netk (2) )hj (1) =ηδkhj (1) Two layer feedforward neural network x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (2) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: output neuron: wk, j (2) = wk, j (2) + Δwk, j (2) Error :δk ≡ (dk − yk ) f(2) '(netk (2) ) Δwk, j (2) =ηErrorkOutputj =ηδkhj (1) δk = (dk − yk ) f(2) '(netk (2) ) dk − yk Backprop to learn the parameters: η : the given learning rate
  • 29. When backprop/learn the parameters E(W) = 1 2 (yk − dk k ∑ )2 Δwk, j (2) ≡ −η ∂E(W) ∂wk, j (2) = −η(yk − dk ) ∂yk ∂netk (2) ∂netk (2) ∂wk, j (2) =η(dk − yk ) f(2) '(netk (2) )hj (1) =ηδkhj (1) Two layer feedforward neural network hj (1) x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (1) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: output neuron: wk, j (2) = wk, j (2) + Δwk, j (2) Error :δk ≡ (dk − yk ) f(2) '(netk (2) ) Δwk, j (2) =ηErrorkOutputj =ηδkhj (1) δk = (dk − yk ) f(2) '(netk (2) ) dk − yk Backprop to learn the parameters: η : the given learning rate
  • 30. Multi-layer feedforward neural network E(W) = 1 2 (yk − dk k ∑ )2 Two layer feedforward neural network x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (1) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: hidden neuron: wj,m (1) = Δwj,m (1) + Δwj,m (1) Error :δj ≡ f(1) '(netj (1) ) (dk − yk ) f(2) '(netk (2) )wk, j (2) ] k ∑ = f(1) '(netj (1) ) δkwk, j (2) k ∑ Δwj,m (1) =ηErrorjOutputm =ηδj xm δk Δwj,m (1) = −η ∂E(W) ∂wj,m (1) = −η ∂E(W) ∂hj (1) ∂hj (1) ∂wj,m (1) =η[ (dk − yk ) f(2) '(netk (1) )wk, j (2) ] k ∑ [xm f(1) '(netj (1) )]=ηδj xm δj = f(1) '(netj (1) ) δkwk, j (2) k ∑ δ1 Backprop to learn the parameters: η : the given learning rate
  • 31. Multi-layer feedforward neural network E(W) = 1 2 (yk − dk k ∑ )2 Two layer feedforward neural network x1 x2 xm wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) labelsnet1 (1) net2 (1) netj (1) netk (2) net1 (2) netk (1) ≡ wk, j (2) hj (1) j ∑netj (1) ≡ wj,m (1) xm m ∑Notation: hidden neuron: wj,m (1) = Δwj,m (1) + Δwj,m (1) Error :δj ≡ f(1) '(netj (1) ) (dk − yk ) f(2) '(netk (2) )wk, j (2) ] k ∑ = f(1) '(netj (1) ) δkwk, j (2) k ∑ Δwj,m (1) =ηErrorjOutputm =ηδj xm δk Δwj,m (1) = −η ∂E(W) ∂wj,m (1) = −η ∂E(W) ∂hj (1) ∂hj (1) ∂wj,m (1) =η[ (dk − yk ) f(2) '(netk (1) )wk, j (2) ] k ∑ [xm f(1) '(netj (1) )]=ηδj xm δj = f(1) '(netj (1) ) δkwk, j (2) k ∑ δ1 Backprop to learn the parameters: η : the given learning rate
  • 32. An example for BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation. 1. Calculate errors of output neurons δα = outα (1 - outα) (Targetα - outα) δ = out (1 - out ) (Target - out ) Inputs Hidden layer Outputs A B C Ω λ α β WAα WAβ WBα WBβ WCα WCβ WΩA WλA WΩB WλB WΩC WλC https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
  • 33. An example for BackpropFigure 3.4, all the calculations for a reverse pass of Back Propagation. 1. Calculate errors of output neurons δα = outα (1 - outα) (Targetα - outα) δβ = outβ (1 - outβ) (Targetβ - outβ) 2. Change output layer weights W+ Aα = WAα + ηδα outA W+ Aβ = WAβ + ηδβ outA W+ Bα = WBα + ηδα outB W+ Bβ = WBβ + ηδβ outB W+ Cα = WCα + ηδα outC W+ Cβ = WCβ + ηδβ outC 3. Calculate (back-propagate) hidden layer errors δA = outA (1 – outA) (δαWAα + δβWAβ) δB = outB (1 – outB) (δαWBα + δβWBβ) δC = outC (1 – outC) (δαWCα + δβWCβ) 4. Change hidden layer weights W+ λA = WλA + ηδA inλ W+ ΩA = W+ ΩA + ηδA inΩ W+ λB = WλB + ηδB inλ W+ ΩB = W+ ΩB + ηδB inΩ W+ λC = WλC + ηδC inλ W+ ΩC = W+ ΩC + ηδC inΩ Inputs Hidden layer Outputs A B C Ω λ α β WAα WAβ WBα WBβ WCα WCβ WΩA WλA WΩB WλB WΩC WλC 1. Calculate errors of output neurons δα = outα (1 - outα) (Targetα - outα) δβ = outβ (1 - outβ) (Targetβ - outβ) 2. Change output layer weights W+ Aα = WAα + ηδα outA W+ Aβ = WAβ + ηδβ outA W+ Bα = WBα + ηδα outB W+ Bβ = WBβ + ηδβ outB W+ Cα = WCα + ηδα outC W+ Cβ = WCβ + ηδβ outC 3. Calculate (back-propagate) hidden layer errors δA = outA (1 – outA) (δαWAα + δβWAβ) δB = outB (1 – outB) (δαWBα + δβWBβ) δC = outC (1 – outC) (δαWCα + δβWCβ) 4. Change hidden layer weights W+ λA = WλA + ηδA inλ W+ ΩA = W+ ΩA + ηδA inΩ W+ λB = WλB + ηδB inλ W+ ΩB = W+ ΩB + ηδB inΩ W+ λC = WλC + ηδC inλ W+ ΩC = W+ ΩC + ηδC inΩ The constant η (called the learning rate, and nominally equal to one) is put up or slow down the learning if required. Inputs Hidden layer Outputs C WCα WCβ WΩC WλC Consider sigmoid activation function https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x)) fSigmoid (x) = 1 1+ e−x δk = (dk − yk ) f(2) '(netk (2) ) Δwk, j (2) =ηErrorkOutputj =ηδkhj (1) δj = f(1) '(netj (1) ) δkwk, j (2) k ∑ Δwj,m (1) =ηErrorjOutputm =ηδj xm
  • 34. Let us do some calculation To illustrate this let’s do a worked Example. Worked example 3.1: Consider the simple network below: Assume that the neurons have a Sigmoid activation function and (i) Perform a forward pass on the network. (ii) Perform a reverse pass (training) once (target = 0.5). (iii) Perform a further forward pass and comment on the result. Answer: (i) Input A = 0.35 Input B = 0.9 Output 0.1 0.8 0.6 0.4 0.3 0.9 https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
  • 35. Let us do some calculation Assume that the neurons have a Sigmoid activation function and (i) Perform a forward pass on the network. (ii) Perform a reverse pass (training) once (target = 0.5). (iii) Perform a further forward pass and comment on the result. Answer: (i) Input to top neuron = (0.35x0.1)+(0.9x0.8)=0.755. Out = 0.68. Input to bottom neuron = (0.9x0.6)+(0.35x0.4) = 0.68. Out = 0.6637. Input to final neuron = (0.3x0.68)+(0.9x0.6637) = 0.80133. Out = 0.69. (ii) Output error δ=(t-o)(1-o)o = (0.5-0.69)(1-0.69)0.69 = -0.0406. New weights for output layer w1+ = w1+(δ x input) = 0.3 + (-0.0406x0.68) = 0.272392. w2+ = w2+(δ x input) = 0.9 + (-0.0406x0.6637) = 0.87305. Errors for hidden layers: δ1 = δ x w1 = -0.0406 x 0.272392 x (1-o)o = -2.406x10-3 δ2= δ x w2 = -0.0406 x 0.87305 x (1-o)o = -7.916x10-3 New hidden layer weights: w3+ =0.1 + (-2.406 x 10-3 x 0.35) = 0.09916. w4+ = 0.8 + (-2.406 x 10-3 x 0.9) = 0.7978. w5+ = 0.4 + (-7.916 x 10-3 x 0.35) = 0.3972. w6+ = 0.6 + (-7.916 x 10-3 x 0.9) = 0.5928. (iii) Old error was -0.19. New error is -0.18205. Therefore error has reduced. https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
  • 37. Activation functions • Logistic Sigmoid: fSigmoid (x) = 1 1+e−x • Output range [0,1] • Motivated by biological neurons and can be interpreted as the probability of an artificial neuron “firing” given its inputs • However, saturatedneurons make gradients vanished (why?) Its derivative: f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
  • 38. Activation functions • Tanh function ftanh (x) = sinh(x) cosh(x) = ex − e−x ex + e−x • Output range [-1,1] • Thus strongly negative inputs to the tanh will map to negative outputs. • Only zero-valued inputs are mapped to near-zero outputs • These properties make the network less likely to get “stuck” during training Its gradient: ftanh '(x) =1− ftanh '(x) Activation Functions tanh(x) - Squashes numbers to range [-1,1] - zero centered (nice) - still kills gradients when saturated :( [LeCun et al., 199 https://theclevermachine.wordpress.com/tag/tanh-function/
  • 39. Active Functions • ReLU (rectified linear unit) fReLU (x) = max(0,x) • Another version is Noise ReLU: • ReLU can be approximated by softplus function • ReLU gradient doesn't vanish as we increase x • It canbe used to model positive number • It is fast as no need for computing the exponential function • It eliminates the necessity to have a “pretraining” phase The derivative: for x>0, f’=1, and x<0, g’=0. fsoftplus (x) = log(1+ex ) fnoisyReLU (x) = Max(0, x + N(0,δ(x))) http://static.googleusercontent.com/media/research.google. com/en//pubs/archive/40811.pdf
  • 40. Active Functions • ReLU (rectified linear unit) fReLU (x) = max(0,x) ReLU can be approximated by softplus function fsoftplus (x) = log(1+ex ) http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf • The only non-linearity comes from the path selection with individual neurons being active or not • It allows sparse representations: • for a given input only a subset of neurons are active Sparse propagation of activations and gradientsAdditional active functions: Leaky ReLU,Exponential LU,Maxout etc
  • 41. Error/Loss function • Recall stochastic gradient descent – update from a randomly picked example (but in practice do a batch update) • Squared error loss for one binary output: w = w −η ∂E(w) ∂w E(w) = 1 2 (dlabel − f (x;w))2 f (x;w))x input output
  • 42. Error/Loss function • Softmax (cross-entropy loss) for multiple classes: whereE(w) = − (dk log yk k ∑ +(1− dk )log(1− yk )) x1 x2 wj,m (1) wk, j (2) y1 yk h1 (1) h2 (1) hj (1) d1 dk f(1)∑ ∑ ∑ f(1) f(1) ∑ ∑ f(2) f(2) One hot encoded class labels net1 (1) net2 (1) netj (1) netk (2) net1 (2) yk = exp( wk, j (2) hj (1) j ∑ ) exp( wk', j (2) hj (1) j ∑ ) k' ∑ (Class labels follow multinomial distribution)
  • 44. Summary • Neural networks basic and its backpropagation learning algorithm • Go deeper in layers • Convolutional neural networks (CNN) • Recurrent neural networks (RNN) • Deep reinforcement learning
  • 46. Why goes deep now? • Big data everywhere • Demands for: – Unsupervised learning: most of them are unlabeled – End-to-end: trainable feature solution • time consuming for hand-crafted features • Improved computational power: – Widely used GPUs for computing – Distributed computing (mapreduce & spark)
  • 47. Deep learning problem • Difficult to train – Many many Parameters • Two solutions: – Regularised by prior knowledge (since 1989 ) • Convolutional Neural Networks (CNN) – New method of pre-training using unsupervised methods (since 2006) • Stacked auto-encoders • Stacked Restricted Boltzmann Machines
  • 48. New methods for unsupervised pre-training “2006” • New methods for unsupervised pre-training 13 • Stacked RBM’s (Deep Belief Networks [DBN’s] ) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Hinton, G. E. and Salakhutdinov, R. R, Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. • (Stacked) Autoencoders • Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19 Stacked Restricted Boltzmann Machines Stacked Autoencoders
  • 49. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926.
  • 50. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926.
  • 51. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926.
  • 52. Restricted Boltzmann Machines (RBM) = Z e−E(v,h) with the energy function h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . (20) } and j ∈ {1, ..., m}, wij is a real valued weight associated een units Vj and Hi and bj and ci are real valued bias terms jth visible and the ith hidden variable, respectively. ected graph of an RBM with n hidden and m visible variables RBM has only connections between the layer of hidden and 4 Restricted Boltzmann Machines A RBM (also denoted as Harmonium [34]) is an MRF associated tite undirected graph as shown in Fig. 1. It consists of m visib (V1, ..., Vm) to represent observable data and n hidden units H to capture dependencies between observed variables. In binary RB in this tutorial, the random variables (V , H) take values (v, h) and the joint probability distribution under the model is given distribution p(v, h) = 1 Z e−E(v,h) with the energy function E(v, h) = − n i=1 m j=1 wijhivj − m j=1 bjvj − n i=1 cihi . For all i ∈ {1, ..., n} and j ∈ {1, ..., m}, wij is a real valued wei with the edge between units Vj and Hi and bj and ci are real valu associated with the jth visible and the ith hidden variable, respe http://image.diku.dk/igel/paper/AItRBM-proof.pdf • A RBM is a bipartite (two disjointedset)undirected graph and used as probabilistic generative models – contains m visible units V = (V1, ..., Vm) to represent observable data and – n hidden units H = (H1, ..., Hn) to capture their dependencies • Assume the random variables (V , H) take values (v, h) ∈ {0, 1}m+n i This equation shows why a (marginalized) RBM can be regarded as a product of expert 58], in which a number of “experts” for the individual components of the observations a multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visib hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl 58], in which a number of “experts” for the individual components of the observations multiplicatively. Any distribution on {0, 1}m can be modeled arbitrarily well by an RBM with m visi hidden units, where k denotes the cardinality of the support set of the target distribution number of input elements from {0, 1}m that have a non-zero probability of being observe been shown recently that even fewer units can be sufficient, depending on the patterns in set [41]. 4.1 RBMs and neural networks The RBM can be interpreted as a stochastic neural network, where the nodes and edges c neurons and synaptic connections, respectively. The conditional probability of a single v one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activa sig(x) = 1/(1 + e−x ), because p(Hi = 1 | v) = sig m j=1 wijvj + ci and p(Vj = 1 | h) = sig n i=1 wijhi + bj . To see this, let v−l denote the state of all visible units except the lth one and let us defi αl(h) = − n i=1 wilhi − bl es is an MRF associated with a bipartite undirected graph as V = (V1, . . . , Vm) representing the observable data, and n e dependencies between the observed variables. In binary variables (V , H) take values (v, h) ∈ {0, 1}m+n and the el is given by the Gibbs distribution p(v, h) = 1 Z e−E(v,h) 1 wijhivj − m j=1 bjvj − n i=1 cihi . (18) is a real valued weight associated with the edge between alued bias terms associated with the jth visible and the Z = v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be rais and biases to lower the energy of that image and to raise the energy of oth that have low energies and therefore make a big contribution to the partitio of the log probability of a training vector with respect to a weight is surp @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the dis subscript that follows. This leads to a very simple learning rule for perf ascent in the log probability of the training data: wij = ✏(hvihjidata hvihjimodel) where ✏ is a learning rate. p(v, h) = 1 Z e E(v,h) where the “partition function”, Z, is given by summing over all possible pair vectors: Z = X v,h e E(v,h) The probability that the network assigns to a visible vector, v, is given by su hidden vectors: p(v) = 1 Z X h e E(v,h) The probability that the network assigns to a training image can be raised b and biases to lower the energy of that image and to raise the energy of other i that have low energies and therefore make a big contribution to the partition f of the log probability of a training vector with respect to a weight is surprisin @ log p(v) @wij = hvihjidata hvihjimodel where the angle brackets are used to denote expectations under the distrib subscript that follows. This leads to a very simple learning rule for perform Joint distribution: , where the energy function sig(): sigmoid We learn the parameters wij by maximising a log-likelihood given example v: No connections within a layer (restricted) leads to <*> data : expectations under the data empirical distribution <*> model : expectations under the model Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010): 926. Can be approximated by Contrastive Divergence
  • 53. Stacked RBM • The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack. • After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives. Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.
  • 54. Stacked RBM • The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack. • After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives. Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507. • Unsupervised learning resultsFig. 4. (A) The fraction of retrieved documents in the same class as the query when a query document from the test set is used to retrieve other test set documents, averaged over all 402,207 possible que- ries. (B) The codes produced by two-dimensional LSA. (C) The codes producedby a2000- 500-250-125-2 autoencoder. 28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org6 tion of in the y when om the e other veraged le que- oduced SA. (C) a2000- ncoder. 28 JULY 2006 VOL 313 SCIENCE www.sciencemag.org The codes produced by two-dimensional LSA (Latent Semantic Analysis). The codes produced by a 2000- 500-250-125-2 StackedRBM
  • 55. Auto-encoder • Multi-layer neural network with target output and input: – Reconstruction xo = decoder(encoder(input xi)) – Minimizing reconstruction error (xi - xo)2 http://deeplearning.net/tutorial/dA.html h = f (We T xi + be ) xo = f (Wd T h + bd ) encoder decoder xo h xi f :sigmoidSetting : Tied weights We T = Wd , • The goal is to learn We, be, bd such that the average reconstruction error is minimised • If f is chosen to be linear, auto-encoder is equivalent to PCA; if non-linear, it captures multi-modal aspects of the input distribution
  • 56. Denoising auto-encoder • A stochastic version of the auto-encoder. • We train the autoencoder to reconstruct the input from a corrupted version of it , so that – force the hidden layer to discover more robust features and – prevent it from simply learning the identity http://deeplearning.net/tutorial/dA.html encoder decoder xo y Corrupted xi • The stochastic corruption process : randomly sets some of the inputs (as many as half of them) to zero • The goal is to predict the missing values from the uncorrupted (i.e., non-missing) values
  • 57. Stacked Denoising Auto-encoder Pretraining: Stacked Denoising Auto-encoder 15 • Stacking Auto-Encoders from: Bengio ICML 2009 Parameter matrix W’ is transpose of W Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks[J]. Advances in neural information processing systems, 2007, 19: 153. Unsupervised greedy layer-wise training procedure (1) The first layer pre-training (2) The second layer pre-training (3) The while network fine- tuning by label
  • 58. Convolutional neural networks: Receptive field • Receptive field: Neurons in the retina respond to light stimulus in restricted regions of the visual field • Animal experiments on receptive fields of two retinal ganglion cells – Fields are circular areas of the retina – The cell (upper part) responds when the center is illuminated and the surround is darkened. – The cell (lower part) responds when the center is darkened and the surround is illuminated. – Both cells give on- and off- responses when both center and surround are illuminated, but neither response is as strong as when only center or surround is illuminated Hubel D.H. : The Visual Cortex of the Brain Sci Amer 209:54-62, 1963 Contributed by Hubel and Wiesel for the studies of the visual system of a cat “On” Center Field “Off” Center Field Light On
  • 59. Convolutional neural networks • Sparse connectivity by local correlation – Filter: the input of a hidden unit in layer m are from a subset of units in layer m-1 that have spatially connected receptive fields • Shared weights – each filter is replicated across the entire visual field. These replicated units share the same weights and form a feature map. http://deeplearning.net/tutorial/lenet.html Translational invariance with convolutional neural networks e previous section, we see that using locality structures significantly reduces the number of connections n further reduction can be achieved via another technique called “weight sharing.” In weight sharing e of the parameters in the model are constrained to be equal to each other. For example, in the following of a network, we have the following constraints w1 = w4 = w7, w2 = w5 = w8 and w3 = w6 = w9 es that have the same color have the same weight): edges that have the same color have the same weight 2-d case (subscripts are weights) 1-d case m layer m-1 layer m-1 layer one filter at m layer
  • 60. LeNet 5 INPUT 32x32 Convolutions SubsamplingConvolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10 LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989, 1(4): 541-551. • LeNet 5 consists of 7 layers, all of which contain trainable weights; • End-to end solution: they are used to replace hand crafted feature extraction • Cx: convolution layers with 5x5 filter • Sx: subsampling (pooling) layers • Fx: fully-connected layers
  • 61. • Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image LeNet 5: Convolution Layer f Input image Feature map10x10 filter 3x3 8x8 Activation function 61Convolutions Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
  • 62. LeNet 5: Convolution Layer 62 Activation functionInput image Feature map10x10 8x8 Response kernel 3x3 f f f Convolutions • Example: a 10x10 input image with a 3x3 filter result in a 8x8 output image • 3 different filters (weights are different) lead to 3 8x8 out images. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
  • 63. LeNet 5: (Sampling) Pooling Layer Max pooling Max in a 2x2 filter Average pooling Average in a 2x2 filter 63Sampling • Pooling: partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum or average value. Max pooling • reduces computation and • is a way of taking the most responsive node of the given interest region, • but may result in loss of accurate spatial information Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
  • 64. LeNet 5 results • MNIST (handwritten digits) Dataset: – 60k training and 10k test examples • Test error rate 0.95% Training error (no distortions) Test error (no distortions) Test error (with distortions) Training Set Size (x1000) 20 30 40 50 60 70 80 90 100 ate (%) 4−>6 3−>5 8−>2 2−>1 5−>3 4−>8 2−>8 3−>5 6−>5 7−>3 9−>4 8−>0 7−>8 5−>3 8−>7 0−>6 3−>7 2−>7 8−>3 9−>4 8−>2 5−>3 4−>8 3−>9 6−>0 9−>8 4−>9 6−>1 9−>4 9−>1 9−>4 2−>0 6−>1 3−>5 3−>2 9−>5 6−>0 6−>0 6−>0 6−>8 4−>6 7−>3 9−>4 4−>6 2−>7 9−>7 4−>3 9−>4 9−>4 9−>4 8−>7 4−>2 8−>4 3−>5 8−>4 6−>5 8−>5 3−>8 3−>8 9−>8 1−>5 9−>8 6−>3 0−>2 6−>5 9−>5 0−>7 1−>6 4−>9 2−>1 2−>8 8−>5 4−>9 7−>2 7−>2 6−>5 9−>7 6−>1 5−>6 5−>0 4−>9 2−>8 Total only 82 errors from LeNet-5. correct answer left and right is the machine answer. 4 8 3 C1 S2 C3 S4 C5 F6 Output 3 4 4 4 4 34 8 3 C1 S2 C3 S4 C5 F6 Output 4 8 3 C1 S2 C3 S4 C5 F6 Output 3 4 4 4 4 34 8 3 C1 S2 C3 S4 C5 F6 Output Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998 http://yann.lecun.com/exdb/mnist/
  • 65. From MNIST to ImageNet • ImageNet – Over 15M labeled high resolution images – Roughly 22K categories – Collected from web and labeled by Amazon Mechanical Turk • The Image/scene classification challenge – Image/scene classification – Metric: Hit@5 error rate - make 5 guesses about the image label http://cognitiveseo.com/blog/6511/will-google-read-rank-images-near-future/ 12 Olga Russakovsky* et al. Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using scene-level queries. tage of all the positive examples available. The second source (24%) is negative images which were part of the original ImageNet collection process but voted as neg- ative: for example, some of the images were collected from Flickr and search engines for the ImageNet synset “animals” but during the manual verification step did not collect enough votes to be considered as containing an “animal.” These images were manually re-verified for the detection task to ensure that they did not in fact contain the target objects. The third source (13%) is images collected from Flickr specifically for the de- tection task. These images were added for ILSVRC2014 following the same protocol as the second type of images in the validation and test set. This was done to bring the training and testing distributions closer together. Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. Hit@5 error rate
  • 66. Leadertable (ImageNet image classification)24 Olga Russakov Image classification Year Codename Error (percent) 99.9% Conf Int 2014 GoogLeNet 6.66 6.40 - 6.92 2014 VGG 7.32 7.05 - 7.60 2014 MSRA 8.06 7.78 - 8.34 2014 AHoward 8.11 7.83 - 8.39 2014 DeeperVision 9.51 9.21 - 9.82 2013 Clarifai† 11.20 10.87 - 11.53 2014 CASIAWS† 11.36 11.03 - 11.69 2014 Trimps† 11.46 11.13 - 11.80 2014 Adobe† 11.58 11.25 - 11.91 2013 Clarifai 11.74 11.41 - 12.08 2013 NUS 12.95 12.60 - 13.30 2013 ZF 13.51 13.14 - 13.87 2013 AHoward 13.55 13.20 - 13.91 2013 OverFeat 14.18 13.83 - 14.54 2014 Orange† 14.80 14.43 - 15.17 2012 SuperVision† 15.32 14.94 - 15.69 2012 SuperVision 16.42 16.04 - 16.80 2012 ISI 26.17 25.71 - 26.65 2012 VGG 26.98 26.53 - 27.43 2012 XRCE 27.06 26.60 - 27.52 2012 UvA 29.58 29.09 - 30.04 Single-object localization Year Codename Error (percent) 99.9% Conf Int 2014 VGG 25.32 24.87 - 25.78 2014 GoogLeNet 26.44 25.98 - 26.92 Fig. 10 For each object class, we consider the b mance of any entry submitted to ILSVRC2012-20 ing entries using additional training data. The the distribution of these “optimistic” per-class resu mance is measured as accuracy for image classific and for single-object localization (middle), and precision for object detection (right). While the very promising in image classification, the ILSVR are far from saturated: many object classes cont challenging for current algorithms. dence interval. We run a large number of b ping rounds (from 20,000 until convergence) shows the results of the top entries to eac ILSVRC2012-2014. The winning methods a tically significantly di↵erent from the other even at the 99.9% level.Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. Unofficial human error is around 5.1% on a subset Why human error still? When labeling, human raters judged whether it belongs to a class (binary classification); the challenge is a 1000-class classification problem. 2015 ResNet (ILSVRC’15) 3.57 GoogLeNet, 22 layers network Microsoft ResNet, a 152 layers network U. of Toronto, SuperVision, a 7 layers network http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/)
  • 67. SuperVision Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems. 2012: 1097-1105. • Rooted from leNet-5 • Rectified Linear Units, overlapping pooling, dropout trick • Randomly extracted 224x224 patches from 256x256 images for more training data
  • 68. dropout • Dropout randomly ‘drops’ units from a layer on each training step, creating ‘sub- architectures’ within the model. • It can be viewed as a type of sampling of a smaller network within a larger network • Prevent Neural Networks from Overfitting Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
  • 69. Go deeper with CNN • GoogLeNet, a 22 layers deep network • Increasing the size of NN results in two difficulties – 1) overfittingand 2) high computation cost • Solution: a sparse deep neural network heclassifier(pre- ainclassifier,but orkisdepictedin ingtheDistBe- temusingmod- m.Althoughwe aroughestimate uldbetrainedto ithinaweek,the Ourtrainingused with0.9momen- reasingthelearn- eraging[13]was encetime. gedsubstantially ion,andalready eroptions,some- parameters,such ore,itishardto ctivesingleway ersfurther,some errelativecrops, ill,oneprescrip- fterthecompeti- atchesoftheim- en8%and100% inedtotheinter- metricdistortions batoverfittingto nChallenge ngeinvolvesthe 00leaf-nodecat- reabout1.2mil- tionand100,000 ociatedwithone measuredbased ons.Twonum- racyrate,which tpredictedclass, thegroundtruth mageisdeemed amongthetop-5, geusesthetop-5 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 Figure3:GoogLeNetnetworkwithallthebellsandwhistles. AveragePool 7x7+1(V) FC SoftmaxActivation softmax2 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Figure3:GoogLeNetnetworkwithallthebellsandwhistles 7 Their filters banks concatenated into a single output vector for the next layer
  • 70. ResNet: Deep Residual Learning • Overfitting prevents a neural network to go deeper • A Thought experiment: – We have a shallower architecture(e.g. logistic regression) – Add more layers onto it to construct its deeper counterpart – So a deeper model should produce no higher training error than its shallower counterpart, if • the added layers are identity mapping, and the other layers are copied from the learned shallower model. • However, it is reported that adding more layers to a very deep model leads to higher training error He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). https://github.com/KaimingHe/deep-residual-networks al Learning for Image Recognition angyu Zhang Shaoqing Ren Jian Sun Microsoft Research xiangz, v-shren, jiansun}@microsoft.com to train. We the training n those used ers as learn- er inputs, in- provide com- hese residual 0 1 2 3 4 5 6 0 10 20 iter. (1e4) trainingerror(%) 0 1 2 3 4 5 6 0 10 20 iter. (1e4) testerror(%) 56-layer 20-layer 56-layer 20-layer Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena esidual Learning for Image Recognition Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com e difficult to train. We rk to ease the training eeper than those used e the layers as learn- to the layer inputs, in- ons. We provide com- ng that these residual 0 1 2 3 4 5 6 0 10 20 iter. (1e4) trainingerror(%) 0 1 2 3 4 5 6 0 10 20 iter. (1e4) testerror(%) 56-layer 20-layer 56-layer 20-layer Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena
  • 71. • Solution: feedforwardneural networks with “shortcut connections” • This leads to incredible 152-layer deep convolution network, driving the error to 3.57% in the classification! ResNet: Deep Residual Learning identity weight layer weight layer relu relu F(x) + x x F(x) x Figure 2. Residual learning: a building block. re comparably good or better than the constructed solution or unable to do so in feasible time). In this paper, we address the degradation problem by ntroducing a deep residual learning framework. In- ead of hoping each few stacked layers directly fit a esired underlying mapping, we explicitly let these lay- rs fit a residual mapping. Formally, denoting the desired nderlying mapping as H(x), we let the stacked nonlinear ayers fit another mapping of F(x) := H(x) x. The orig- nal mapping is recast into F(x)+x. We hypothesize that it easier to optimize the residual mapping than to optimize ImageNet test set, and won the 1st place in 2015 classification competition. The extrem resentations also have excellent generalization on other recognition tasks, and lead us to fu 1st places on: ImageNet detection, ImageNe COCO detection, and COCO segmentation i COCO 2015 competitions. This strong eviden the residual learning principle is generic, and w it is applicable in other vision and non-vision 2. Related Work Residual Representations. In image recogn [18] is a representation that encodes by the re with respect to a dictionary, and Fisher Vecto formulated as a probabilistic version [18] of of them are powerful shallow representations trieval and classification [4, 48]. For vector encoding residual vectors [17] is shown to b tive than encoding original vectors. He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). https://github.com/KaimingHe/deep-residual-networks 7x7conv,64,/2 pool,/2 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,128,/2 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,256,/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,512,/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 avgpool fc1000 image 3x3conv,512 3x3conv,64 3x3conv,64 pool,/2 3x3conv,128 3x3conv,128 pool,/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 pool,/2 3x3conv,512 3x3conv,512 3x3conv,512 pool,/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 pool,/2 fc4096 fc4096 fc1000 image output size:112 output size:224 output size:56 output size:28 output size:14 output size:7 output size:1 VGG-1934-layerplain 7x7conv,64,/2 pool,/2 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,64 3x3conv,128,/2 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,128 3x3conv,256,/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,512,/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512 avgpool fc1000 image 34-layerresidual Figure3.ExamplenetworkarchitecturesforImageNet.Left:the VGG-19model[41](19.6billionFLOPs)asareference.Mid- dle:aplainnetworkwith34parameterlayers(3.6billionFLOPs). Right:aresidualnetworkwith34parameterlayers(3.6billion FLOPs).Thedottedshortcutsincreasedimensions.Table1shows moredetailsandothervariants. ResidualN insertshort networkint shortcuts(E outputareo Fig.3).Wh inFig.3),w performsid forincreasi parameter;( matchdime options,wh sizes,theya 3.4.Imple Ourimp in[21,41]. domlysamp A224⇥224 horizontalfl standardco normalizati beforeactiv asin[13]an useSGDw startsfrom0 andthemod useaweigh donotused Intesting 10-croptes convolution atmultiple sideisin{2 4.Experi 4.1.Image Weevalu cationdatas aretrained atedonthe resultonth Weevaluate PlainNetw plainnets.T 18-layerpla tailedarchit Theresu nethashigh plainnet.T paretheirtr cedure.We 4 1. the desired underlying mapping as H(x). 2. In stead of fitting H(x), a stack of nonlinear layers fit another mapping of F(x) := H(x)−x. 3. The original mapping is recast into F(x)+x 4. During training, it would be easy to push the residual (F(x)) to zero (create an identity mapping) by a stack of nonlinear layers. 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) plain-20 plain-32 plain-44 plain-56 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-11056-layer 20-layer 110-layer 20-layer Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing erro of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 120
  • 72. Summary • Neural networks basic and its backpropagation learning algorithm • Go deeper in layers • Convolutional neural networks (CNN) • Recurrent neural networks (RNN) • Deep reinforcement learning
  • 73. Recurrent neural networks • Why we need another NN model – Sequential prediction – Time series • A simple recurrent neural network (RNN) – Training can be done by BackPropagation Through Time (BPTT) http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ s = f (xU) o = f (sV) x :input vector, o :output vector, s : hidden state vector, U : layer 1 param. matrix, V : layer 2 param. matrix, f : tanh or ReLU Two-layer feedforward network Add time-dependency of the hidden state s W : State transition param. matrix st+1 = f (xt+1U+ stW) ot+1 = f (st+1V)
  • 74. LSTMs (Long Short Term Memory networks) • Standard Recurrent Network(SRN) has a problem of learning long-term dependency (due to vanishing gradients of saturated nodes) • a LSTM provides another way to compute a hidden state Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ st−1 st xt ct−1 ct ot ... f i o input gate forget gate output gate “candidate” hidden state: Cell internal memory Hidden state σ :sigmoid (control signal between 0 and 1); o: elementwise multiplication st−1 xt ot... st st = tanh(xtU+ st−1W) SRN cell LSTM cell http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 75. • ClockWork-RNN is similar to a simple RNN with an input, output and hidden layer • Difference lies in – The hidden layer is partitioned into g modules each with its own clock rate – Neurons in faster module are connected to neurons in a slower module RNN applications: time series Koutnik, Jan, et al. "A clockwork rnn." arXiv preprint arXiv:1402.3511 (2014). A Clockwork RNN Figure 1. CW-RNN architecture is similar to a simple RNN with an input, output and hidden layer. The hidden layer is partitioned into g modules each with its own clock rate. Within each module the neurons are fully interconnected. Neurons in faster module i are connected to neurons in a slower module j only if a clock period Ti < Tj. 2. Related Work Contributions to sequence modeling and recognition that are relevant to CW-RNNs are presented in this section. The pri- mary focus is on RNN extensions that deal with the problem of bridging long time lags. Hierarchical Recurrent Neural Networks (Hihi & Bengio, 1996) are the most similar to CW-RNNs work, having both delayed connections and units operating at different time- scales. Five hand-designed architectures were tested on have been very successful recently in speech and handwrit- ing recognition (Graves et al., 2005; 2009; Sak et al., 2014). Stacking LSTMs (Fernandez et al., 2007; Graves & Schmidhuber, 2009) into a hierarchical sequence proces- sor, equipped with Connectionist Temporal Classification (CTC; Graves et al., 2006), performs simultaneous segmen- tation and recognition of sequences. Its deep variant cur- rently holds the state-of-the-art result in phoneme recogni- tion on the TIMIT database (Graves et al., 2013). A Clockwork RNN
  • 76. Word embedding • From bag of word to word embedding – Use a a real-valued vector in Rm to represent a word (concept) v(‘‘cat")=(0.2, -0.4, 0.7, ...) v(‘‘mat")=(0.0, 0.6, -0.1, ...) • Continuous bag of word (CBOW) model (word2vec) – Input/output words x/y are one-hot encoded – Hidden layer is shared for all input words Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). Continuous bag of word (CBOW) model Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014). where C is the number of words in the context, w1, · · · , wC are the words the and vw is the input vector of a word w. The loss function is E = = log p(wO|wI,1, · · · , wI,C) = uj⇤ + log VX j0=1 exp(uj0 ) = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) which is the same as (7), the objective of the one-word-context model, ex di↵erent, as defined in (18) instead of (1). Input layer Hidden layer Output layer WV×N WV×N WV×N W'N×V yjhi x2k x1k xCk C×V-dim N-dim V-dim Figure 2: Continuous bag-of-word model The update equation for the hidden!output weights stay the same a one-word-context model (11). We copy it here: v0 wj (new) = v0 wj (old) ⌘ · ej · h for j = 1, 2, · · · , V. Note that we need to apply this to every element of the hidden!output we each training instance. prediction error, the more significant e↵ects a word will exert on the movement on the input vector of the context word. As we iteratively update the model parameters by going through context-target word pairs generated from a training corpus, the e↵ects on the vectors will accumulate. We can imagine that the output vector of a word w is “dragged” back-and-forth by the input vectors of w’s co-occurring neighbors, as if there are physical strings between the vector of w and the vectors of its neighbors. Similarly, an input vector can also be considered as being dragged by many output vectors. This interpretation can remind us of gravity, or force-directed graph layout. The equilibrium length of each imaginary string is related to the strength of cooccurrence between the associated pair of words, as well as the learning rate. After many iterations, the relative positions of the input and output vectors will eventually stabilize. 1.2 Multi-word context Figure 2 shows the CBOW model with a multi-word context setting. When computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and use the product of the input!hidden weight matrix and the average vector as the output. h = 1 C W · (x1 + x2 + · · · + xC) (17) = 1 C · (vw1 + vw2 + · · · + vwC ) (18) 5 Hidden nodes: The cross-entropy loss: j j0=1 j = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) (21) which is the same as (7), the objective of the one-word-context model, except that h is di↵erent, as defined in (18) instead of (1). Figure 2: Continuous bag-of-word model The update equation for the hidden!output weights stay the same as that for the one-word-context model (11). We copy it here: v0 wj (new) = v0 wj (old) ⌘ · ej · h for j = 1, 2, · · · , V. (22) Note that we need to apply this to every element of the hidden!output weight matrix for each training instance. 6 The update equation for input!hidden weights is similar to (16), except that now we need to apply the following equation for every word wI,c in the context: v(new) wI,c = v(old) wI,c 1 C · ⌘ · EH for c = 1, 2, · · · , C. (23) where vwI,c is the input vector of the c-th word in the input context; ⌘ is a positive learning rate; and EH = @E @hi is given by (12). The intuitive understanding of this update equation is the same as that for (16). The gradient updates: N-dim Vector representation of a word V: vocabulary size; C: num. input words; v: row vector of input matrix W; v’: row vector of output matrix W’ where C is the number of words in the context, w1, · · · , wC are the words the in the context, and vw is the input vector of a word w. The loss function is E = = log p(wO|wI,1, · · · , wI,C) (19) = uj⇤ + log VX j0=1 exp(uj0 ) (20) = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) (21) which is the same as (7), the objective of the one-word-context model, except that h is di↵erent, as defined in (18) instead of (1). where C is the number of words in the context, w1, · · · , wC are the words the in the context, and vw is the input vector of a word w. The loss function is E = = log p(wO|wI,1, · · · , wI,C) (19) = uj⇤ + log VX j0=1 exp(uj0 ) (20) = v0 wO T · h + log VX j0=1 exp(v0 wj T · h) (21) which is the same as (7), the objective of the one-word-context model, except that h is di↵erent, as defined in (18) instead of (1). or v0 wj (new) = v0 wj (old) ⌘ · ej · h for j = 1, 2, · · · , V. where ⌘ > 0 is the learning rate, ej = yj tj, and hi is the i-th unit in the hidden v0 wj is the output vector of wj. Note that this update equation implies that we ha go through every possible word in the vocabulary, check its output probability yj compare yj with its expected output tj (either 0 or 1). If yj > tj (“overestimating”) we subtract a proportion of the hidden vector h (i.e., vwI ) from v0 wO , thus making farther away from vwI ; if yj < tj (“underestimating”), we add some h to v0 wO , thus m v0 wO closer3 to vwI . If yj is very close to tj, then according to the update equation little change will be made to the weights. Note, again, that vw (input vector) an (output vector) are two di↵erent vector representations of the word w. Update equation for input!hidden weights Having obtained the update equations for W0, we can now move on to W. We tak derivative of E on the output of the hidden layer, obtaining @E @hi = VX j=1 @E @uj · @uj @hi = VX j=1 ej · w0 ij := EHi where hi is the output of the i-th unit of the hidden layer; uj is defined in (2), th input of the j-th unit in the output layer; and ej = yj tj is the prediction error
  • 77. Remarkable properties from Word embedding • Simple algebraic operations with the word vectors v(‘‘woman")−v(‘‘man") ≃ v(‘‘aunt")−v(‘‘uncle") v(‘‘woman")−v(‘‘man") ≃ v(‘‘queen")−v(‘‘king") Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." HLT-NAACL. 2013. Figure 2: Left panel shows vector offsets for three word pairs illustrating the gender relation. Right panel shows a different projection, and the singular/plural relation for two words. In high-dimensional space, multiple relations can be embedded for a single word. provided. We have explored several related meth- ods and found that the proposed method performs well for both syntactic and semantic relations. We note that this measure is qualitatively similar to rela- tional similarity model of (Turney, 2012), which pre- dicts similarity between members of the word pairs (xb, xd), (xc, xd) and dis-similarity for (xa, xd). 6 Experimental Results To evaluate the vector offset method, we used vectors generated by the RNN toolkit of Mikolov M LS LS LS RN RN RN RN Tabl diffe M RN CW CW HL HL Tabl lobe Log Figure 2: Left panel shows vector offsets for three word pairs illustrating the gender relation. Right panel shows a different projection, and the singular/plural relation for two words. In high-dimensional space, multiple relations can be embedded for a single word. provided. We have explored several related meth- ods and found that the proposed method performs well for both syntactic and semantic relations. We note that this measure is qualitatively similar to rela- tional similarity model of (Turney, 2012), which pre- dicts similarity between members of the word pairs (xb, xd), (xc, xd) and dis-similarity for (xa, xd). 6 Experimental Results Vector offsets for gender relation The singular/plural relation for two words Word the relationship is defined by subtracting two word vectors, and the result is added to another word. Thus for example, Paris - France + Italy = Rome. Using X = v(”biggest”) − v(”big”) + v(”small”) as query and searching for the nearest word based on cosine distance results in v(”smallest”) Single-prototype English embeddings by Huang et al. (2012) are used to initialize Chinese em- beddings. The initialization readily provides a set (Align-Init) of benchmark embeddings in experi- ments (Section 4), and ensures translation equiva- lence in the embeddings at start of training. 3.2.2 Bilingual training Using the alignment counts, we form alignment matrices Aen→zh and Azh→en. For Aen→zh, each row corresponds to a Chinese word, and each col- umn an English word. An element aij is first as- signed the counts of when the ith Chinese word is aligned with the jth English word in parallel text. After assignments, each row is normalized such that it sums to one. The matrix Azh→en is defined sim- ilarly. Denote the set of Chinese word embeddings as Vzh, with each row a word embedding, and the set of English word embeddings as Ven. With the two alignment matrices, we define the Translation Equivalence Objective: JTEO-en→zh = ∥Vzh − Aen→zhVen∥2 (3) JTEO-zh→en = ∥Ven − Azh→enVzh∥2 (4) We optimize for a combined objective during train- ing. For the Chinese embeddings we optimize for: JCO-zh + λJTEO-en→zh (5) For the English embeddings we optimize for: JCO-en + λJTEO-zh→en (6) During bilingual training, we chose the value of λ such that convergence is achieved for both JCO and JTEO. A small validation set of word similarities 3.3 Curriculum training We train 100k-vocabulary word embeddings using curriculum training (Turian et al., 2010) with Equa- tion 5. For each curriculum, we sort the vocabu- lary by frequency and segment the vocabulary by a band-size taken from {5k, 10k, 25k, 50k}. Separate bands of the vocabulary are trained in parallel using minibatch L-BFGS on the Chinese Gigaword cor- pus 3. We train 100,000 iterations for each curricu- lum, and the entire 100k vocabulary is trained for 500,000 iterations. The process takes approximately 19 days on a eight-core machine. We show visual- ization of learned embeddings overlaid with English in Figure 1. The two-dimensional vectors for this vi- sualization is obtained with t-SNE (van der Maaten and Hinton, 2008). To make the figure comprehen- sible, subsets of Chinese words are provided with reference translations in boxes with green borders. Words across the two languages are positioned by the semantic relationships implied by their embed- dings. Figure 1: Overlaid bilingual embeddings: English words are plotted in yellow boxes, and Chinese words in green; reference translations to English are provided in boxesZou, Will Y., et al. "Bilingual Word Embeddings for Phrase-Based Machine Translation." EMNLP. 2013.
  • 78. Neural Language models • n-gram model – Construct conditional probabilities for the next word, given combinations of the last n- 1 words ( contexts) • Neural language model – associate with each word a distributed word feature vector for word embedding, – express the joint probability function of word sequences using those vectors, and – learn simultaneously the word feature vectors and the parameters of that probability function. he training points (e.g., training sentences) is distributed in a larger volume, usually in some form eighborhood around the training points. In high dimensions, it is crucial to distribute probability s where it matters rather than uniformly in all directions around each training point. We will w in this paper that the way in which the approach proposed here generalizes is fundamentally erent from the way in which previous state-of-the-art statistical language modeling approaches generalizing. A statistical model of language can be represented by the conditional probability of the next d given all the previous ones, since ˆP(wT 1 ) = T ∏ t=1 ˆP(wt|wt 1 1 ), ere wt is the t-th word, and writing sub-sequence w j i = (wi,wi+1,··· ,wj 1,wj). Such statisti- language models have already been found useful in many technological applications involving ural language, such as speech recognition, language translation, and information retrieval. Im- vements in statistical language models could thus have a significant impact on such applications. When building statistical models of natural language, one considerably reduces the difficulty his modeling problem by taking advantage of word order, and the fact that temporally closer ds in the word sequence are statistically more dependent. Thus, n-gram models construct ta- of conditional probabilities for the next word, for each one of a large number of contexts, i.e. mbinations of the last n 1 words: ˆP(wt|wt 1 1 ) ⇡ ˆP(wt|wt 1 t n+1). only consider those combinations of successive words that actually occur in the training cor- or that occur frequently enough. What happens when a new combination of n words appears was not seen in the training corpus? We do not want to assign zero probability to such cases, ause such new combinations are likely to occur, and they will occur even more frequently for where 1 t=1 1 d, and writing sub-sequence w j i = (wi,wi+1,··· ,wj 1,wj). Such statisti- e already been found useful in many technological applications involving s speech recognition, language translation, and information retrieval. Im- anguage models could thus have a significant impact on such applications. tical models of natural language, one considerably reduces the difficulty m by taking advantage of word order, and the fact that temporally closer nce are statistically more dependent. Thus, n-gram models construct ta- bilities for the next word, for each one of a large number of contexts, i.e. n 1 words: ˆP(wt|wt 1 1 ) ⇡ ˆP(wt|wt 1 t n+1). combinations of successive words that actually occur in the training cor- ntly enough. What happens when a new combination of n words appears raining corpus? We do not want to assign zero probability to such cases, nations are likely to occur, and they will occur even more frequently for mple answer is to look at the probability predicted using a smaller context rigram models (Katz, 1987) or in smoothed (or interpolated) trigram mod- 1980). So, in such models, how is generalization basically obtained from in the training corpus to new sequences of words? A way to understand ink about a generative model corresponding to these interpolated or back- ntially, a new sequence of words is generated by “gluing” very short and gth 1, 2 ... or up to n words that have been seen frequently in the training ning the probability of the next piece are implicit in the particulars of the n-gram algorithm. Typically researchers have used n = 3, i.e. trigrams, -art results, but see Goodman (2001) for how combining many tricks can BENGIO, DUCHARME, VINCENT AND JAUVIN softmax tanh . . . . . .. . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . . C C wt 1wt 2 C(wt 2) C(wt 1)C(wt n+1) wt n+1 i-th output = P(wt = i | context) Figure 1: Neural architecture: f(i,wt 1,··· ,wt n+1) = g(i,C(wt 1),··· ,C(wt n+1 neural network and C(i) is the i-th word feature vector. parameters of the mapping C are simply the feature vectors themselves, represenBengio, Yoshua, et al. "Neural probabilistic language models." Innovations in Machine Learning. Springer Berlin Heidelberg, 2006. 137-186.