Deep Learning seminar presentation, Max Planck Institute for Informatics.
Based on the papers by Weston et al. (ICLR2015), Graves et al. (2014), and Sukhbaatar et al. (2015)
"Memory Networks", "End-to-end memory networks", "Neural Turing Machine"
Memory Networks, Neural Turing Machines, and Question Answering
1. 1/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Memory Networks, Neural Turing Machines,
and Question Answering
Akram El-Korashy1
1Max Planck Institute for Informatics
November 30, 2015
Deep Learning Seminar.
Papers by Weston et al. (ICLR2015), Graves et al. (2014), and
Sukhbaatar et al. (2015)
2. 2/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Outline
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
3. 3/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory?
Human’s working memory is a capacity for short-term storage
of information and its rule-based manipulation. . .
Therefore, an NTM1resembles a working memory system, as it
is designed to solve tasks that require the application of
approximate rules to “rapidly-created variables”.
1
Neural Turing Machine. I will use it interchangeably with Memory
Networks, depending on which paper I am citing.
4. 4/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory? Why not RNNs and LSTM?
The memory in these models is the state of the network, which
is latent (i.e., hidden; no exlpicit access) and inherently
unstable over long timescales. [Sukhbaatar2015]
Unlike a standard network, NTM interacts with a memory matrix
using selective read and write operations that can focus on
(almost) a single memory location. [Graves2014]
5. 5/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory networks? How about attention models with RNN
encoders/decoders?
The memory model is indeed analogous to the attention
mechanisms introduced for machine translation.
6. 5/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory networks? How about attention models with RNN
encoders/decoders?
The memory model is indeed analogous to the attention
mechanisms introduced for machine translation.
Main differences
In a memory network model, the query can be made over
multiple sentences, unlike machine translation.
The memory model makes several hops on the memory
before making an output.
The network architecture of the memory scoring is a
simple linear layer, as opposed to a sophisticated gated
architecture in previous work.
7. 6/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory? What’s the main usage?
Memory as non-compact storage
Explicitly update memory slots mi on test time by making use of
a “generalization” component that determines “what” is to be
stored from input x, and “where” to store it (choosing among
the memory slots).
Storing stories for Question Answering
Given a story (i.e., a sequence of sentences), training of the
output component of the memory network can learn scoring
functions (i.e., similarity) between query sentences and existing
memory slots from previous sentences.
8. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
A memory model that is trained only end-to-end.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
9. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
Trained model takes a set of inputs x1, ..., xn to be stored in
the memory, a query q, and outputs an answer a.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
10. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
Each of xi, q, a contains symbols coming from a dictionary
with V words.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
11. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
All x is written to memory up to a fixed buffer size, then find
a continuous representation for the x and q.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
12. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
The continuous representation is then processed via
multiple hops to output a.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
13. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
This allows back-propagation of the error signal through
multiple memory accesses back to input during training.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
14. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
A, B, C are embedding matrices (of size d × V) used to
convert the input to the d-dimensional vectors mi.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
15. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
A match is computed between u and each memory mi by
taking the inner product followed by a softmax.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
16. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
The response vector o from the memory is the weighted
sum: o = i pici.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
17. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
The final prediction (answer to the query) is computed with
the help of a weight matrix as: ˆa = Softmax(W(o + u)).
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
18. 8/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Plan
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
19. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
There are a total of 20 different types of tasks that test
different forms of reasoning and deduction.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
20. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
Note that for each question, only some subset of the
statements contain information needed for the answer, and
the others are essentially irrelevant distractors (e.g., the
first sentence in the first example).
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
21. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
In the Memory Networks of Weston et al., this supporting
subset was explicitly indicated to the model during training.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
22. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
In what is called end-to-end training of memory networks,
this information is no longer provided.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
23. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
20 QA tasks. A task is a set of example problems. A
problem is a set of I sentences xi where I ≤ 320, a
question q and an answer a.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
24. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
The vocabulary is of size V = 177! Two versions of the
data are used, one that has 1000 training problems per
task, and one with 10,000 per task.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
25. 10/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemN2N
Model Architecture
K = 3 hops were used.
Adjacent weight sharing was used to ease training and reduce
the number of parameters.
Adjacent weight tying
1 The output embedding of a layer is input to the layer
above. (Ak+1 = Ck )
2 Answer prediction is the same as the final output.
(WT = CK )
3 Question embedding is the same as the input to the first
layer. (B = A1)
26. 11/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemN2N
Sentence Representation, Temporal Encoding
Two different sentence representations: bag-of-words
(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,
e.g., mi = j Axij .
PE encodes the position of the word using a column vector
lj where lkj = (1 − j/J) − (k/d)(1 − 2j/J), where J is the
number of words in the sentence.
27. 11/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemN2N
Sentence Representation, Temporal Encoding
Two different sentence representations: bag-of-words
(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,
e.g., mi = j Axij .
PE encodes the position of the word using a column vector
lj where lkj = (1 − j/J) − (k/d)(1 − 2j/J), where J is the
number of words in the sentence.
Temporal Encoding: Modify the memory vector with a
special matrix that encodes temporal information. 2
Now, mi = j Axij + TA(i), where TA(i) is the ith row of a
special temporal matrix TA.
All the T matrices are learned during training. They are
subject to the sharing constraints as between A and C.
2
There isn’t enough detail on what constraints this matrix should be
subject to, if any.
28. 12/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Training
Loss function and learning parameters
Embedding Matrices A, B and C, as well as W are jointly
learnt.
Loss function is a standard cross entropy between ˆa and
the true label a.
Stochastic gradient descent is used with learning rate of
η = 0.01, with annealing.
29. 13/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Training
Parameters and Techniques
RN: Learning time invariance by injecting random noise to
regularize TA
LS: Linear start: Remove all softmax except for the answer
prediction layer. Apply it back when validation loss stops
decreasing. (LS learning rate of η = 0.005 instead of 0.01
for normal training.)
LW: Layer-wise, RNN-like weight tying. Otherwise,
adjacent weight tying.
BoW or PE: sentence representation.
joint: training on all 20 tasks jointly vs independently.
[Sukhbaatar2015]
30. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
RN: Learning time invariance by injecting random noise to
regularize TA
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
31. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
LS: Linear start: Remove all softmax except for the answer prediction layer. Apply it back when validation loss stops
decreasing. (LS learning rate of η = 0.005 instead of 0.01 for normal training.)
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
32. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
LW: Layer-wise, RNN-like weight tying. Otherwise, adjacent
weight tying.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
33. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
BoW or PE: sentence representation.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
34. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
take-home msg: More memory hops give improved
performance.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
35. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
take-home msg: Joint training on various tasks sometimes
helps.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
36. 15/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
Set of Supporting Facts
Figure: Instances of successful prediction of the supporting
sentences.
37. 16/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Plan
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
38. 17/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemNN
IGOR
The memory network consists of a memory m and 4 learned
components
1 I: (input feature map) - converts the incoming input to the
internal feature representation.
2 G: (generalization) - updates old memories given the new
input.
3 O: (output feature map) - produces a new output, given the
new input and the current memory state.
4 R: (response) - converts the output into the response
format desired.
39. 18/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemNN
Model Flow
The core of inference lies in the O and R modules. The O
module produces output features by finding k supporting
memories given x.
For k = 1, the highest scoring supporting memory is
retrieved: o1 = O1(x, m) = argmax
i=1,...,N
sO(x, mi).
For k = 2, a second supporting memory is additionally
computed: o2 = O2(x, m) = argmax
i=1,...,N
sO([x, mo1
], mi).
In the single-word response setting, where W is the set of
all words in the dict., then r = argmax
w∈W
sR([x, mo1
, mo2
], w).
40. 19/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Training
Max-margin, SGD
Supporting sentences annotations are available as part of the training
data. Thus, scoring functions are trained by minimizing a margin
ranking loss over the model parameters UO and UR using SGD.
Figure: For a given question x with true response r and supporting
sentences mO1
, mO2
(i.e., k = 2), this expression is minimized over
parameters UO and UR:
where ¯f, ¯f and ¯r are all other choices than the correct labels, and γ is the margin.
41. 20/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoring
Idea: hash the inputs I(x) into buckets, and score memories mi lying
in the same buckets only.
42. 20/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoring
word hash: a bucket per dict. word, containing all sentences that
contain this word.
43. 20/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoring
cluster hash: Run K-means to cluster word vectors (UO)i , giving K
buckets. Hash sentence to all buckets in which its words belong.
44. 21/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
simulation QA
Figure: The task is a simple simulation of 4 characters, 3 objects and
5 rooms - with characters moving around, picking up and dropping
objects. (Similar to the 10k dataset of MemN2N)
45. 22/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
simulation QA - sample test rseults
Figure: Sample test set predictions (in red) for the simulation in the
setting of word-based input and where answers are sentences and an
LSTM is used as the R component of the MemNN.
46. 23/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Plan
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
47. 24/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture
More sophisticated memory “controller”.
Figure: Content-addressing is implemented by learning similarity
measures, analogous to MemNN. Additionally, the controller offers
simulation of location-based addressing by implementing a rotational
shift of a weighting.
48. 25/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
NTM learns a Copy task
Figure: The networks were trained to copy sequences of eight bit
random vectors, where the sequence lengths were randomized
between 1 and 20. NTM with LSTM controller was used.
49. 25/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
... on which LSTM fails
Figure: The networks were trained to copy sequences of eight bit
random vectors, where the sequence lengths were randomized
between 1 and 20. NTM with LSTM controller was used.
50. 26/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Summary
Intuition of memory networks vs standard neural network
models.
MemNN is successful through strongly-supervised learning
in QA tasks
MemN2N is used with more realistic end-to-end training,
and is competent enough on the same tasks.
NTMs can learn simple memory copy and recall tasks from
input-memory, output-memory training data.
51. 26/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Summary
Intuition of memory networks vs standard neural network
models.
MemNN is successful through strongly-supervised learning
in QA tasks
MemN2N is used with more realistic end-to-end training,
and is competent enough on the same tasks.
NTMs can learn simple memory copy and recall tasks from
input-memory, output-memory training data.
Thank you!
52. 27/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
References
End-To-End Memory Networks, Sainbayar Sukhbaatar,
Arthur Szlam, Jason Weston, Rob Fergus, 2015.
Memory Networks, Jason Weston, Sumit Chopra, Antoine
Bordes, 2015
Neural Turing Machines, Alex Graves, Greg Wayne, Ivo
Danihelka, 2014
Deep learning at Oxford 2015, Nando de Freitas