Invited Talk for the SIGNLL Conference on Computational Natural Language Learning 2017 (CoNLL 2017) Chris Dyer (DeepMind / CMU) 3 Aug 2017. Vancouver, Canada
How to Add Existing Field in One2Many Tree View in Odoo 17
Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Reflect Linguistic Structure?
1. Should Neural Network
Architecture Reflect
Linguistic Structure?
CoNLLAugust 3, 2017
Adhi Kuncoro
(Oxford)
Phil Blunsom
(DeepMind)
Dani Yogatama
(DeepMind)
Chris Dyer
(DeepMind/CMU)
Miguel Ballesteros
(IBM)
Wang Ling
(DeepMind)
Noah A. Smith
(UW)
Ed Grefenstette
(DeepMind)
4. (1) a. The talk I gave did not appeal to anybody.
Modeling language
Sentences are hierarchical
Examples adapted from Everaert et al. (TICS 2015)
5. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Modeling language
Sentences are hierarchical
Examples adapted from Everaert et al. (TICS 2015)
6. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Modeling language
Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
7. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
Modeling language
Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
8. (1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
(2) *The talk I did not give appealed to anybody.
Modeling language
Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
9. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
- the psychological reality of structural sensitivty
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language
easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
10. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language
easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
11. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language
easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
12. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language
easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
13. Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty
is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language
easily because they don’t consider many “obvious”
structural insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
14. • Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
• In fact, RNNs are Turing complete!
(Siegelmann, 1995)
• But do they make good generalizations from finite samples of
data?
• What inductive biases do they have?
• Using syntactic annotations to improve inductive bias.
• Inferring better compositional structure without explicit
annotation
Recurrent neural networks
Good models of language?
(part 1)
(part 2)
15. • Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
• In fact, RNNs are Turing complete!
(Siegelmann, 1995)
• But do they make good generalizations from finite samples of
data?
• What inductive biases do they have?
• Using syntactic annotations to improve inductive bias.
• Inferring better compositional structure without explicit
annotation
Recurrent neural networks
Good models of language?
(part 1)
(part 2)
16. • Understanding the biases of neural networks is tricky
• We have enough trouble understanding the representations they learn in
specific cases, much less general cases!
• But, there is lots of evidence RNNs prefer sequential recency
• Evidence 1: Gradients become attenuated across time
• Analysis; experiments with synthetic datasets
(yes, LSTMs help but they have limits)
• Evidence 2: Training regimes like reversing sequences in seq2seq learning
• Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
• Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement.
Recurrent neural networks
Inductive bias
17. • Understanding the biases of neural networks is tricky
• We have enough trouble understanding the representations they learn in
specific cases, much less general cases!
• But, there is lots of evidence RNNs have a bias for sequential recency
• Evidence 1: Gradients become attenuated across time
• Functional analysis; experiments with synthetic datasets
(yes, LSTMs/GRUs help, but they also forget)
• Evidence 2: Training regimes like reversing sequences in seq2seq learning
• Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
• Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement.
Recurrent neural networks
Inductive bias
18. The keys to the cabinet in the closet is/are on the table.
AGREE
Linzen, Dupoux, Goldberg (2017)
19. • Understanding the biases of neural networks is tricky
• We have enough trouble understanding the representations they learn in
specific cases, much less general cases!
• But, there is lots of evidence RNNs have a bias for sequential recency
• Evidence 1: Gradients become attenuated across time
• Functional analysis; experiments with synthetic datasets
(yes, LSTMs/GRUs help, but they also forget)
• Evidence 2: Training regimes like reversing sequences in seq2seq learning
• Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
• Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement.
Recurrent neural networks
Inductive bias
Chomsky (crudely paraphrasing 60 years of work):
Sequential recency is not the right bias for
modeling human language.
20. An alternative
Recurrent Neural Net Grammars
• Generate symbols sequentially using an RNN
• Add some control symbols to rewrite the history occasionally
• Occasionally compress a sequence into a constituent
• RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
• This is a top-down, left-to-right generation of a tree+sequence
Adhi Kuncoro
21. An alternative
Recurrent Neural Net Grammars
• Generate symbols sequentially using an RNN
• Add some control symbols to rewrite the history occasionally
• Occasionally compress a sequence into a constituent
• RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
• This is a top-down, left-to-right generation of a tree+sequence
Adhi Kuncoro
22. • Generate symbols sequentially using an RNN
• Add some control symbols to rewrite the history occasionally
• Occasionally compress a sequence into a constituent
• RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
• This is a top-down, left-to-right generation of a tree+sequence
An alternative
Recurrent Neural Net Grammars
Adhi Kuncoro
30. stack action
(S
(S (NP
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
31. stack action
(S
(S (NP
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
32. stack action
(S
(S (NP
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
33. stack action
(S
(S (NP
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
34. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
35. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
36. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
(S (NP The hungry cat )
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
37. stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
(S (NP The hungry cat )
(S (NP The hungry cat)
Compress “The hungry cat”
into a single composite symbol
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
38. stack action
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
39. stack action
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
40. stack action
(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
41. stack action
GEN(meows)(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
42. stack action
GEN(meows)
REDUCE
(S (NP The hungry cat) (VP meows) GEN(.)
REDUCE
(S (NP The hungry cat) (VP meows) .)
(S (NP The hungry cat) (VP meows) .
(S (NP The hungry cat) (VP meows
(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
43. • Valid (tree, string) pairs are in bijection to valid sequences of
actions (specifically, the DFS, left-to-right traversal of the
trees)
• Every stack configuration perfectly encodes the complete
history of actions.
• Therefore, the probability decomposition is justified by the
chain rule, i.e.
(chain rule)
(prop 2)
p(x, y) = p(actions(x, y))
p(actions(x, y)) =
Y
i
p(ai | a<i)
=
Y
i
p(ai | stack(a<i))
(prop 1)
Some things you can show
59. TheNP cat ) NP(
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
hungry
Syntactic composition
Recursion
60. TheNP cat ) NP(
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
| {z }
v
v
Syntactic composition
Recursion
61. Modeling the next action
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
h1 h2 h3 h4
62. Modeling the next action
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
⇠REDUCE
h1 h2 h3 h4
63. Modeling the next action
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
⇠REDUCE
(S (NP The hungry cat) (VP meows)p(ai+1 | )
64. Modeling the next action
(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
⇠REDUCE
(S (NP The hungry cat) (VP meows)p(ai+1 | )
3. limited updates
3. Limited updates to state → stack RNNs
65. • If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs
Inductive bias?
66. • If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
67. • If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
“keys”
z }| {
68. • If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
“keys”
z }| {
69. • If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs
Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
The keys to the cabinet in the closet is/areSequential RNN:
“keys”
z }| {
70. • Generative models are great! We have defined a
joint distribution p(x,y) over strings (x) and parse
trees (y)
• We can look at two questions:
• What is p(x) for a given x? [language modeling]
• What is max p(y | x) for a given x? [parsing]
RNNGs
A few empirical results
71. English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)
Single model
Gen 91.1
Vinyals et al (2015)
PTB only
Disc 90.5
Shindo et al (2012)
Ensemble
Gen 92.4
Vinyals et al (2015)
Semisupervised
Disc+SemiS
up
92.8
Discriminative
PTB only
Disc 91.7
Generative
PTB only
Gen 93.6
Choe and Charniak (2016)
Semisupervised
Gen
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
72. English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)
Single model
Gen 91.1
Vinyals et al (2015)
PTB only
Disc 90.5
Shindo et al (2012)
Ensemble
Gen 92.4
Vinyals et al (2015)
Semisupervised
Disc+SemiS
up
92.8
Discriminative
PTB only
Disc 91.7
Generative
PTB only
Gen 93.6
Choe and Charniak (2016)
Semisupervised
Gen
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
73. English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)
Single model
Gen 91.1
Vinyals et al (2015)
PTB only
Disc 90.5
Shindo et al (2012)
Ensemble
Gen 92.4
Vinyals et al (2015)
Semisupervised
Disc+SemiS
up
92.8
Discriminative
PTB only
Disc 91.7
Generative
PTB only
Gen 93.6
Choe and Charniak (2016)
Semisupervised
Gen
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
74. Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)
Single model
Gen 91.1
Vinyals et al (2015)
PTB only
Disc 90.5
Shindo et al (2012)
Ensemble
Gen 92.4
Vinyals et al (2015)
Semisupervised
Disc+SemiS
up
92.8
Discriminative
PTB only
Disc 91.7
Generative
PTB only
Gen 93.6
Choe and Charniak (2016)
Semisupervised
Gen
+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
English PTB (Parsing)
76. • RNNGs are effective for modeling language and parsing
• Finding 1: Generative parser > discriminative parser
• Better sample complexity
• No label bias in modeling action sequence
• Finding 2: RNNGs > RNNs
• PTB phrase structure is a useful inductive bias for making good
linguistic generalizations!
• Open question: Do RNNGs solve the Linzen, Dupoux, and
Goldberg problem?
Summary
77. • Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
• In fact, RNNs are Turing complete!
(Siegelmann, 1995)
• But do they make good generalizations from finite samples of
data?
• What inductive biases do they have?
• Using syntactic annotations to improve inductive bias.
• Inferring better compositional structure without explicit
annotation
Recurrent neural networks
Good models of language?
(part 1)
(part 2)
78. • We have compared two end points: a sequence model and a
PTB-based syntactic model
• If we search for structure to obtain the best performance on down
stream tasks, what will it find?
• In this part of the talk, I focus on representation learning for
sentences.
Unsupervised structure
Do we need syntactic annotation?
Dani Yogatama
79. • Advances in deep learning have led to three predominant approaches for
constructing representations of sentences
Background
• Convolutional neural networks
• Recurrent neural networks
• Recursive neural networks
Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015
Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014
Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
80. • Advances in deep learning have led to three predominant approaches for
constructing representations of sentences
Background
• Convolutional neural networks
• Recurrent neural networks
• Recursive neural networks
Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015
Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014
Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
81. Recurrent Encoder
A boy drags his sleds through the snow
output
input
(word embeddings)
L = log p(y | x)
83. Recursive Encoder
• Prior work on tree-structured neural networks assumed that the trees are
either provided as input or predicted based on explicit human
annotations (Socher et al., 2013; Bowman et al., 2016; Dyer et al., 2016)
• When children learn a new language, they are not given parse trees
• Infer tree structure as a latent variable that is marginalized during learning
of a downstream task that depends on the composed representation.
• Hierarchical structures can also be encoded into a word embedding
model to improve performance and interpretability (Yogatama et al., 2015)
L = log p(y | x)
= log
X
z2Z(x)
p(y, z | x)
84. Model
• Shift reduce parsing
A boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
85. Model
• Shift reduce parsing
A boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
86. Model
• Shift reduce parsing
boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
SHIFT
A
A
Tree
87. Model
• Shift reduce parsing
boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
A
Tree
88. Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
SHIFT
A
boy
A
Tree
boy
89. Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
boy
A
Tree
boy
90. Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
boy
REDUCE
A
Tree
boy
91. Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
REDUCE
i = (WI[hi, hj] + bI)
o = (WO[hi, hj] + bI)
fL = (WFL
[hi, hj] + bFL
)
fR = (WFR
[hi, hj] + bFR
)
g = tanh(WG[hi, hj] + bG)
c = fL ci + fR cj + i g
h = o c
• Compose top two elements of the stack with Tree LSTM (Tai et al., 2015
Zhu et al., 2015)
A
boy
92. Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
REDUCE
• Compose top two elements of the stack with Tree LSTM (Tai et al., 2015
Zhu et al., 2015)
Tree-LSTM(a, boy)
• Push the result back onto the stack
A
Tree
boy
93. Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
Tree-LSTM(a, boy)
A
Tree
boy
94. Model
• Shift reduce parsing
his sleds
(Aho and Ullman, 1972)
Stack Queue
Tree-LSTM(a, boy)
SHIFT
drags
A
Tree
boy drags
95. Model
• Shift reduce parsing (Aho and Ullman, 1972)
Stack Queue
REDUCE
Tree-LSTM(Tree-LSTM(a,boy), Tree-LSTM(drags, Tree-LSTM(his,sleds)))
A
Tree
boy drags his sleds
96. • Shift reduce parsing
Model
1 2 4 5
3 6
7
1 2 4 6
3
7
5
1 2 3 4
7
5
6
1 2 3 6
4
5
7
S, S, R, S, S, R, R S, S, S, R, R, S, R S, S, R, S, R, S, R S, S, S, S, R, R, R
Different Shift/Reduce sequences lead to different tree structures
(Aho and Ullman, 1972)
97. Learning
• How do we learn the policy for the shift reduce sequence?
• Supervised learning! Imitation learning!
• But we need a treebank.
• What if we don’t have labels?
98. Reinforcement Learning
• Given a state, and a set of possible actions, an agent needs to
decide what is the best possible action to take
• There is no supervision, only rewards which can be not observed until
after several actions are taken
99. Reinforcement Learning
• Given a state, and a set of possible actions, an agent needs to
decide what is the best possible action to take
• There is no supervision, only rewards which can be not observed until
after several actions are taken
• State
• Action
• A Shift-Reduce “agent”
• Reward
embeddings of top two elements of the stack,
embedding of head of the queue
shift, reduce
log likelihood on a downstream task given the
produced representation
100. Reinforcement Learning
• We use a simple policy gradient method, REINFORCE (Williams, 1992)
R(W) = E⇡(a,s;WR)
" TX
t=1
rtat
#
⇡(a | s; WR)• The goal of training is to train a policy network
to maximize rewards
• Other techniques for approximating the marginal likelihood are available
(Guu et al., 2017)
101. Stanford Sentiment Treebank
Method Accuracy
Naive Bayes (from Socher et al., 2013) 81.8
SVM (from Socher et al., 2013) 79.4
Average of Word Embeddings (from Socher et al., 2013) 80.1
Bayesian Optimization (Yogatama et al., 2015) 82.4
Weighted Average of Word Embeddings (Arora et al., 2017) 82.4
Left-to-Right LSTM 84.7
Right-to-Left LSTM 83.9
Bidirectional LSTM 84.7
Supervised Syntax 85.3
Semi-supervised Syntax 86.1
Latent Syntax 86.5
Socher et al., 2013
102. Stanford Sentiment Treebank
Method Accuracy
Naive Bayes (from Socher et al., 2013) 81.8
SVM (from Socher et al., 2013) 79.4
Average of Word Embeddings (from Socher et al., 2013) 80.1
Bayesian Optimization (Yogatama et al., 2015) 82.4
Weighted Average of Word Embeddings (Arora et al., 2017) 82.4
Left-to-Right LSTM 84.7
Right-to-Left LSTM 83.9
Bidirectional LSTM 84.7
Supervised Syntax 85.3
Semi-supervised Syntax 86.1
Latent Syntax 86.5
Socher et al., 2013
105. • Trees look “non linguistic”
• But downstream performance is great!
• Ignore the generation of the sentence in favor of
optimal structures for interpretation
• Natural languages must balance both.
Grammar Induction
Summary
106. • Do we need better bias in our models?
• Yes! They are making the wrong generalizations,
even from large data.
• Do we have to have the perfect model?
• No! Small steps in the right direction can pay big
dividends.
Linguistic Structure
Summary