3. WHAT IS GENERATIVE MODEL
(x) = g (z)
https://blog.openai.com/generative‐models/
pθ^ θ
DeepBio 3
Z
'
.
feature
space
.
l Assumption
)
for Connectionist .
%
TZMTTYVCT
CNOTSAMPLINCT )
4. WHY GENERATIVE
The new way of simulating applied math/engineering domain
Combining with Reinforcement Learning
Good for semi‐supervised learning
Can work with multi‐modal output
Can make data with realitic generation
DeepBio 4
ZAN 's TALK W wynsrolb ,
& mmm
8. Generative model
(x) = g (z)
Let z ∼ N(0, 1)
Let g be a neural networks with transpose convolutional layers
﴾So Nice !!﴿
x ∼ X : MNIST dataset
L2 Loss ﴾Mean Square Error﴿
p^θ θ
DeepBio 8
FEATURE SPACE
PARAMHZZZFD BY O
=)
MA×2MUMw4L2k2e2H=
if font I -
No , E)
p(Ylx7 ~
N(g(Ho)
,
f) -8440) :
learner parfait .
LIPIO ) =
by Tlpcai , yi )
=
Ggtplyiiai ) Paci ) =6gTNYiK) +6ft Ma )
argmoaxtulokartmfncogtplesilnt =
hytrstrexp thrashing
=
-
Nloy FE6 -
Ira ( Itt -
genital P )
12. Notations
x : Observed data, z : Latent variable
p(x) : Evidence, p(z) : Prior
p(x∣z) : Likelihood, p(z∣x) : Posterior
Probabilistic Model Difined As Joint Distribution of x, z
p(x, z)
DeepBio 12
I
t.FM#yfnsPH=niaeH=
#
eK=xd=
Fj
p(y=yzA=yrji
G-
§
no
,
r
;
s
Fini,plrx-aifsEPk-aiiEahPCZZ1zIPCx.nayt.ls@TkneNZ-zslx.x
;)
=
I.
Pixar , 2- =
th =
¥ =
¥ Fu
=P IEE
,
I # ni ) Paxil
PRIME
13. Model
p(x, z) = p(x∣z)p(z)
Our Interest Is Posterior !!
p(z∣x)
p(z∣x) = : Infer Good Value of z Given x
p(x) = p(x, z)dz = p(x∣z)p(z)dz
p(x) is Hard to calculate﴾INTRACTABLE﴿
Approaximate Posterior
p(x)
p(x∣z)p(z)
∫ ∫
DeepBio 13
4th =
HE rhueirjsinry 02.2 Latent variable
2 't 3h17 , "
I did ?
*t#N
Bayesian
Znfercnce
£ OBERVABUE
.
ZMP2Rzc#=
( BAYHZAN )
( PRODUCT RULE )
SAMMY
gMY#*oHAstw4HARD JOB 't
guys 's
.si#PuN4
:
14. Variational Inference
Pick a familiy of distributions over the latent variables with its
own variational parameters,
q (z∣x)
Find ϕ that makes q close to the posterior of interest
ϕ
DeepBio 14
www.vpaste.MN @
0 At KNT DON'T Access )
PARAMZRZZZD
¢
we
-0
/
-
Measure
⇒
USZNCT Vz
,
SAMPLZNCT PROBLEM
maw
"
pqootp → OPTIMZZZNLT PROBLEM ,
y →
gfor
gaussian , ( µ ,
r )
for uniform ,
( *min
, * max
)
i.
15. KULLBACK LEIBLER DIVERGENCE
Only if Q(i) = 0 implies P(i) = 0, for all i,
Measure of the non‐symmetric difference between two
probability distributions P and Q
KL(P∣∣Q) = p(x) log dx∫
q(x)
p(x)
= p(x) log p(x)dx − p(x) log q(x)dx∫ ∫
DeepBio 15
Pe > Qc
c)
equivalent
.
=) Q > P at Ewan th Kee
Pnt '
Ehyiohf %Z Malek 3h
we ioy M 39
,
Qcij '
t o
4mL
Ki ) '
to Terminator .
fiercely=) KENTROPY -
ENTROPY
"
BECAUSE 67 ENTROPY
,
NWTYMIETRK
ENTROPY = UNCERTAINTY
16. Property
The Kullback Leibler divergence is always non‐negative,
KL(P∣∣Q) ≥ 0
DeepBio 16
kl ( MIQ ) =/ pimhg My dn
Hmm @tTH )
17. Proof
X − 1 ≥ log X ⇒ log ≥ 1 − X
Using this,
X
1
KL(P∣∣Q) = p(x) log dx∫
q(x)
p(x)
≥ p(x) 1 − dx∫ (
p(x)
q(x)
)
= {p(x) − q(x)}dx∫
= p(x)dx − q(x)dx∫ ∫
= 1 − 1 = 0
DeepBio 17
¥t¥Ia⇒#i* ⇒
"
:# * ,
Lee
: :
19. Maximizing Likelihood is equivalent to minimizing KL
Divergence
ϕ∗
= argmin − p(x) log q(x; ϕ)dxϕ ( ∫ )
= argmax p(x) log q(x; ϕ)dxϕ ∫
= argmax E [log q(x; ϕ)]ϕ x∼p(x;ϕ)
≊ argmax Σ log q(x ; ϕ)ϕ [
N
1
i
N
i ]
DeepBio 19
¥
III?Iki :
:*;¥a
-
LOLTLZKZLZHOLOD
20. JENSEN'S INEQUALITY
For Concave Function, f(E[x]) ≥ E[f(x)]
For Conveax Function, f(E[x]) ≤ E[f(x)]
DeepBio 20
An c.
Norte
.#T¥±i:#⇒fftn )
21. Evidence Lower BOund
log p(x) = log p(x, z)dx∫
z
= log p(x, z) dx∫
z q(z)
q(z)
= log q(z) dx∫
z q(z)
p(x, z)
= log E dxq
q(z)
p(x, z)
≥ E [log p(x, z)] −E [log q(z)]q q
DeepBio 21
of i. WZU KNOWN PROBABZLISTZC
Xdz
DZSTRZBUTZON
-
÷of
www.?EeumgD*µ÷⇒=***n⇒×
dz
-
zefso
LOCTPCHE
2-430am on at .
22. Variational Distribution
q (z∣x) = argmin KL(q (z∣x)∣∣p (z∣x))
Choose a family of variational distributions﴾q﴿
Fit the parameter﴾ϕ﴿ to minimize the distance of two
distribution﴾KL‐Divergence﴿
ϕ
∗
ϕ ϕ θ
DeepBio 22
go.fi#ttI
( RZVZRSZ KL DWERCTZNEE
)
23. KL Divergence
KL(q (z∣x)∣∣p (z∣x))ϕ θ =E logqϕ
[
p (z∣x)θ
q (z∣x)ϕ
]
=E log q (z∣x) − log p (z∣x)qϕ
[ ϕ θ ]
=E log q (z∣x) − log p (z∣x)qϕ
[ ϕ θ
p (x)θ
p (x)θ
]
=E log q (z∣x) − log p (x, z) + log p (x)qϕ
[ ϕ θ θ ]
=E [log q (z∣x) − log p (x, z)] + log p (x)qϕ ϕ θ θ
DeepBio 23
1KZVERSE)
24. Object
q (z∣x) = argmin E log q (z∣x) − log p (x, z) + log p (x)
q (z∣x) is negative ELBO plus log marginal probability of x
log p (x) does not depend on q
Minimizing the KL divergence is the same as maximizing the
ELBO
q (z∣x) = argmax ELBO
ϕ
∗
ϕ [ qϕ
[ ϕ θ ] θ ]
ϕ
∗
θ
ϕ
∗
ϕ
DeepBio 24
frtoolmyttmmee
MZNZMZZZKL 7430
→•
→
hfpdn )
6
mm
-
EUBO
25. Variational Lower Bound
For each data point x , marginal likelihood of individual data pointi
log p (x )θ i ≥ L(θ, ϕ; x )i
=E − log q (z∣x ) + log p (x , z)q (z∣x )ϕ i
[ ϕ i θ i ]
=E log p (x ∣z)p (z) − log q (z∣x )q (z∣x )ϕ i
[ θ i θ ϕ i ]
=E log p (x ∣z) − (log q (z∣x ) − log p (z))q (z∣x )ϕ i
[ θ i ϕ i θ ]
=E log p (x ∣z) −E logq (z∣x )ϕ i
[ θ i ] q (z∣x )ϕ i
[(
p (z)θ
q (z∣x )ϕ i
)]
=E log p (x ∣z) − KL q (z∣x )∣∣p (z)q (z∣x )ϕ i
[ θ i ] (( ϕ i θ ))
DeepBio 25
EUBO
Infarct
IT
a
- µAxvM2#
⇒ KLBIMZMMH )
#yq# EKKAVGATA
26. ELBO
L(θ, ϕ; x ) =E log p (x ∣z) − KL q (z∣x )∣∣p (z)
q (z∣x ) : proposal distribution
p (z) : prior ﴾our belief﴿
How to Choose a Good Proposal Distribution
Easy to sample
Differentiable ﴾∵ Backprop.﴿
i q (z∣x )ϕ i
[ θ i ] (( ϕ i θ ))
ϕ i
θ
DeepBio 26
n
posterior approximate
→ Earth 4h .
) → CTAVKZAN
27. Maximizing ELBO ‐ I
L(ϕ; x ) =E log p(x ∣z) − KL q (z∣x )∣∣p(z)
ϕ = argmax E log p(x ∣z)
E log p(x ∣z) : Log‐Likelihood ﴾NOT LOSS﴿
Maximize likelihood for maximizing ELBO ﴾NOT MINIMIZE!!﴿
i q (z∣x )ϕ i
[ i ] (( ϕ i ))
∗
ϕ q (z∣x )ϕ i
[ i ]
q (z∣x )ϕ i
[ i ]
DeepBio 27
( Lott ruklrtloob )
28. Log Likelihood
In case of Bernoulli distribution p(x∣z) is,
E log p(x∣z) = x log p(y ) + (1 − x ) log(1 − p(y ))
For maximize it, minimize Negative Log Likelihood !!
Loss = − [x log( ) + (1 − x ) log(1 − )]
Already know as Sigmoid Cross‐Entropy
is output of Decoder
We call it Reconstructure Loss
q (z∣x)ϕ
i=1
∑
n
i i i i
n
1
i=1
∑
n
i x^i i x^i
x^i
DeepBio 28
normalisation
I
L
f :
the
output
is 4 ,
i ] )
* zl
£CH
( or Binomial Cross
Entropy )
Zn ale of Faustian distribution ,
loss
= L 2 los } ( Mk )
29. Maximizing ELBO ‐ II
L(ϕ; x ) =E log p(x ∣z) − KL q (z∣x )∣∣p(z)
ϕ = argmin KL q (z∣x )∣∣p(z)
Assume that prior and posterior approaximation are Gaussian
﴾actually it's not a critical issue...﴿
Then we can use KL Divergence according to definition
Let prior be N(0, 1)
How about q (z∣x ) ?
i q (z∣x )ϕ i
[ i ] (( ϕ i ))
∗
ϕ (( ϕ i ))
ϕ i
DeepBio 29
30. Posterior
Posterior approaximation is Gaussian,
q (z∣x ) = N(μ , σ )
where, (μ , σ ) is the output of Encoder
ϕ i i i
2
i i
DeepBio 30
if dimofznto
⇒
Nof µ ,
6 =@
• •
• •
31. Minimizing KL Divergence
KL(q (z∣x)∣∣p(z)) = q (z) log q (z)dz − q (z) log p(z)dz
q (z) log q (z∣x)dz = N(μ , σ ) log N(μ , σ )dz
= − log 2π − (1 + log σ )
q (z) log p(z)dz = N(μ , σ ) log N(0, 1)dz
= − log 2π − (μ + σ )
Therefore,
KL(q (z∣x)∣∣p(z)) = 1 + log σ − μ − σ
ϕ ∫ ϕ ϕ ∫ ϕ
∫ ϕ ϕ ∫ i i
2
i i
2
2
N
2
1
∑N
i
2
∫ ϕ ∫ i i
2
2
N
2
1
∑N
i
2
i
2
ϕ
2
1
∑
N
[ i
2
i
2
i
2
]
DeepBio 31
for
)"EFFI.
! Basic format
32. AUTO‐ENCODER
Encoder : MLPs to Infer (μ , σ ) for q (z∣x )
Decoder : MLPs to Infer using latent variables ∼ N(μ, σ )
Is it differentiable? ﴾ = possible to backprop?﴿
i i ϕ i
x^ 2
DeepBio 32
J I
33. REPARAMETERIZATION TRICK
Tutorial on Variational Autoencoders
DeepBio 33
NOT ABLE To 0
BACKPAY -
Now , sampling process is
independent
To the model .
1- → D k ) I GAMPLENLT ( not ✓ armpit )
( Tust constant )
40. Features
Advantage
Fast and Easy to train
We can check the loss and evaluate
Disadvantage
Low Quality
Even though q reached the optimal point, it is quite different with p
Issues
Reconstruction loss ﴾x‐entropy, L1, L2, ...﴿
MLPs structure
Regularizer loss ﴾sometimes don't use log, sometimes use exp, ...﴿
...
DeepBio 40
45. Value Function
min max V (D, G)
=E [log D(x)] +E [log(1 − D(G(z)))]
For second term, E [log(1 − D(G(z)))]
D want to maximize it → Do not fool
G want to minimize it → Fool
G D
x∼p (x)data z∼p (z)z
z∼p (z)z
DeepBio 45
D
47. Global Optimulity of p = p
D (x) =
note that 'FOR ANY GIVEN generator G'
g data
G
∗
p + p (x)data g
p (x)data
DeepBio 47
olaphoz of Ham CT4 output 't original data st Foot 2tt .
14
48. Proof
For G fixed,
V (G, D) = p (x) log(D(x))dx + p (z) log(1 − D(G(z))dz
= p (x) log(D(x)) + p (x) log(1 − D(x))dx
Let X = D(x), a = p (x), b = p (x). So,
V = a log X + b log(1 − X)
Find X which can maximize the value function V .
∇ V
∫x r ∫z g
∫x r g
r g
X
DeepBio 48
8-8#
d- pug
if P .
=
Pg
,
then pay =D ( GCZI ) alternate
@ € #
[ 1
49. Proof
∇ VX = ∇ a log X + b log(1 − X)X ( )
= ∇ a log X + ∇ b log(1 − X)X X
= a + b
X
1
1 − X
−1
=
X(1 − X)
a(1 − X) − bX
=
X(1 − X)
a − aX − bX
=
X(1 − X)
a − (a + b)X
DeepBio 49
51. Proof
Find the solution of this,
f(X) = a − (a + b)X
Solution,
Function f(X) is monotone decreasing.
∴ is the maximum point of f(X).
a − (a + b)X = 0
(a + b)X = a
X =
a + b
a
a+b
a
DeepBio 51
←
fix ) has maximum
point
52. Theorem
The global minimum of the virtual training criterion L(D, g ) is
achieved if and only if p = p .
At that point, L(D, g ) achieves the value − log 4.
θ
g r
θ
DeepBio 52
53. Proof
L(D , g ) = max V (G, D)∗
θ D
=E [log D (x)] +E [log(1 − D (G(z)))]x∼pr G
∗
z∼pz G
∗
=E [log D (x)] +E [log(1 − D (x))]x∼pr G
∗
x∼pg G
∗
=E [log ] +E [log ]x∼pr
p (x) + p (x)r g
p (x)r
x∼pg
p (x) + p (x)r g
p (x)g
=E [log ] +E [log ] + log 4 − log 4x∼pr
p (x) + p (x)r g
p (x)r
x∼pg
p (x) + p (x)r g
p (x)g
=E [log ] + log 2 +E [log ] + log 2 − log 4x∼pr
p (x) + p (x)r g
p (x)r
x∼pg
p (x) + p (x)r g
p (x)g
=E [log ] +E [log ] − log 4x∼pr
p (x) + p (x)r g
2p (x)r
x∼pg
p (x) + p (x)r g
2p (x)g
DeepBio 53
← fixed D
,
find EF
)
it , Preae =P
gen
.
54. where JS is Jensen‐Shannon Divergence difined as
JS(P∣∣Q) = KL(P∣∣M) + KL(Q∣∣M)
where, M = (P + Q)
∵ JS always ≥ 0, then − log 4 is global minimum
=E [log(p (x)/ )] +E [log(p (x)/ ] − log 4x∼pr r
2
p (x) + p (x)r g
x∼pg g
2
p (x) + p (x)r g
= KL[p (x)∣∣ ] + KL[p (x)∣∣ ] − log 4r
2
p (x) + p (x)r g
g
2
p (x) + p (x)r g
= −log4 + 2JS(p (x)∣∣p (x))r g
2
1
2
1
2
1
DeepBio 54
55. Jensen‐Shannon Divergence
JS(P∣∣Q) = KL(P∣∣M) + KL(Q∣∣M)
Two types of KL Divergence
KL(P∣∣Q) : Maximum liklihood. Approximations Q that overgeneralise P
KL(Q∣∣P) : Reverse KL Divergence. tends to favour under‐generalisation.
The optimal Q will typically describe the single largest mode of P well
Jensen Divergence would exhibit a behaviour that is kind of halfway
between the two extremes above
2
1
2
1
DeepBio 55
58. Training
Cost Function For D
J = − E log D(x) − E log(1 − D(G(z)))
Typical cross entropy with label 1, 0 ﴾Bernoulli﴿
Cost Function For G
J = − E log(D(G(z)))
Maximize log D(G(z)) instead of minimizing
log(1 − D(G(z))) ﴾cause vanishing gradient﴿
Also standard cross entropy with label 1
Really Good this way is??
(D)
2
1
x∼pdata 2
1
z
(G)
2
1
z
DeepBio 58
59. Secret of G Loss
We already know that
E [∇ log(1 − D (g (z)))] = ∇ 2JS(P ∣∣P )
Furthurmore,
z θ
∗
θ θ r g
KL(P ∣∣P )g r =E logx[
p (x)r
p (x)g
]
=E log −E logx[
p (x)r
p (x)g
] x[
p (x)g
p (x)g
]
=E log − KL(P ∣∣P )x[
1 − D (x)∗
D (x)∗
] g g
=E log − KL(P ∣∣P )x[
1 − D (g (z))∗
θ
D (g (z))∗
θ
] g g
DeepBio 59
( from Martin )
60. Taking derivatives in θ at θ we get
Subtracting this last equation with result for JSD,
E [−∇ log D (g (z))] = ∇ [KL(P ∣∣P ) − JS(P ∣∣P )]
JS push for the distributions to be different, which seems like a
fault in the update
KL appearing here assigns an extremely high cost to
generation fake looking samples, and an extremely low cost on
mode dropping
0
∇ KL(P ∣∣P )θ gθ r = −∇ E log − ∇ KL(P ∣∣P )θ z[
1 − D (g (z))∗
θ
D (g (z))∗
θ
] θ gθ gθ
=E −∇ logz[ θ
1 − D (g (z))∗
θ
D (g (z))∗
θ
]
z θ
∗
θ θ gθ r gθ r
DeepBio 60
78. Normalizing Input
normalize the images between ‐1 and 1
Tanh as the last layer of the generator output
A Modified Loss Function
Like maximizing D(G(z)) instead of minimizing
1 − D(G(z))
Use a spherical Z
Sample from a gaussian distribution rather that uniform
DeepBio 78
79. XX Norm
One label per one mini‐batch
Batch norm, layer norm, instance norm, or batch renorm ...
Avoid Sparse Gradients : Relu, MaxPool
the stability of the GAN game suffers if you have sparse
gradients
leakyRelu = good ﴾in both G and D﴿
For down sampling, use : AVG pooling, strided conv
For up sampling, use : Conv_transpose, PixelShuffle
DeepBio 79
80. Use Soft and Noisy Lables
real : 1 ‐> 0.7 ~ 1.2
fake : 0 ‐> 0.0 ~ 0.3
flip for discriminator﴾occasionally﴿
ADAM is Good
SGD for D, ADAM for G
If you have labels, use them
go to the Conditional GAN
DeepBio 80
81. Add noise to inputs, decay over time
add some artificial noise to inputs to D
adding gaussian noise to every layer of G
Use dropout in G in both train and test phase
Provide noise in the form of dropout
Apply on several layers of our G at both traing and test time
DeepBio 81