SlideShare une entreprise Scribd logo
1  sur  107
Télécharger pour lire hors ligne
Should Neural Network
Architecture Reflect
Linguistic Structure?
CoNLLAugust 3, 2017
Adhi Kuncoro

(Oxford)
Phil Blunsom

(DeepMind)
Dani Yogatama

(DeepMind)
Chris Dyer

(DeepMind/CMU)
Miguel Ballesteros

(IBM)
Wang Ling

(DeepMind)
Noah A. Smith

(UW)
Ed Grefenstette

(DeepMind)
Should Neural Network
Architecture Reflect
Linguistic Structure?
Yes.
CoNLLAugust 3, 2017
But why?


And how?
CoNLLAugust 3, 2017
(1) a. The talk I gave did not appeal to anybody.
Modeling language

Sentences are hierarchical
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Modeling language

Sentences are hierarchical
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Modeling language

Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
Modeling language

Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.
b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
(2) *The talk I did not give appealed to anybody.
Modeling language

Sentences are hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
- the psychological reality of structural sensitivty

is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language

easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty

is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language

easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty

is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language

easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty

is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language

easily because they don’t consider many “obvious” struct
insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in the
linguistics literature [31].) When the relationship between not and anybody adheres to this
structural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not
in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence
is not well-formed.
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybody
did not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.
(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
Generalization: not must “structurally precede” anybody
- the psychological reality of structural sensitivty

is not empirically controversial
- many different theories of the details of structure
- more controversial hypothesis: kids learn language

easily because they don’t consider many “obvious”

structural insensitive hypotheses
Language is hierarchical
The talk
I gave
did not
appeal to anybody
appealed to anybodyThe talk
I did not give
• Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
• In fact, RNNs are Turing complete!

(Siegelmann, 1995)
• But do they make good generalizations from finite samples of
data?
• What inductive biases do they have?
• Using syntactic annotations to improve inductive bias.
• Inferring better compositional structure without explicit
annotation
Recurrent neural networks

Good models of language?
(part 1)
(part 2)
• Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
• In fact, RNNs are Turing complete!

(Siegelmann, 1995)
• But do they make good generalizations from finite samples of
data?
• What inductive biases do they have?
• Using syntactic annotations to improve inductive bias.
• Inferring better compositional structure without explicit
annotation
Recurrent neural networks

Good models of language?
(part 1)
(part 2)
• Understanding the biases of neural networks is tricky
• We have enough trouble understanding the representations they learn in
specific cases, much less general cases!
• But, there is lots of evidence RNNs prefer sequential recency
• Evidence 1: Gradients become attenuated across time
• Analysis; experiments with synthetic datasets

(yes, LSTMs help but they have limits)
• Evidence 2: Training regimes like reversing sequences in seq2seq learning
• Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
• Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement.
Recurrent neural networks

Inductive bias
• Understanding the biases of neural networks is tricky
• We have enough trouble understanding the representations they learn in
specific cases, much less general cases!
• But, there is lots of evidence RNNs have a bias for sequential recency
• Evidence 1: Gradients become attenuated across time
• Functional analysis; experiments with synthetic datasets

(yes, LSTMs/GRUs help, but they also forget)
• Evidence 2: Training regimes like reversing sequences in seq2seq learning
• Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
• Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement.
Recurrent neural networks

Inductive bias
The keys to the cabinet in the closet is/are on the table.
AGREE
Linzen, Dupoux, Goldberg (2017)

• Understanding the biases of neural networks is tricky
• We have enough trouble understanding the representations they learn in
specific cases, much less general cases!
• But, there is lots of evidence RNNs have a bias for sequential recency
• Evidence 1: Gradients become attenuated across time
• Functional analysis; experiments with synthetic datasets

(yes, LSTMs/GRUs help, but they also forget)
• Evidence 2: Training regimes like reversing sequences in seq2seq learning
• Evidence 3: Modeling enhancements to use attention (direct connections back
in remote time)
• Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement.
Recurrent neural networks

Inductive bias
Chomsky (crudely paraphrasing 60 years of work):



Sequential recency is not the right bias for

modeling human language.
An alternative

Recurrent Neural Net Grammars
• Generate symbols sequentially using an RNN
• Add some control symbols to rewrite the history occasionally
• Occasionally compress a sequence into a constituent
• RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
• This is a top-down, left-to-right generation of a tree+sequence
Adhi Kuncoro
An alternative

Recurrent Neural Net Grammars
• Generate symbols sequentially using an RNN
• Add some control symbols to rewrite the history occasionally
• Occasionally compress a sequence into a constituent
• RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
• This is a top-down, left-to-right generation of a tree+sequence
Adhi Kuncoro
• Generate symbols sequentially using an RNN
• Add some control symbols to rewrite the history occasionally
• Occasionally compress a sequence into a constituent
• RNN predicts next terminal/control symbol based on the history of
compressed elements and non-compressed terminals
• This is a top-down, left-to-right generation of a tree+sequence
An alternative

Recurrent Neural Net Grammars
Adhi Kuncoro
The hungry cat meows loudly
Example derivation

stack action probability
stack action probability
NT(S) p(nt(S) | top)
stack action
(S
probability
NT(S) p(nt(S) | top)
stack action
(S
probability
NT(S) p(nt(S) | top)
NT(NP) p(nt(NP) | (S)
stack action
(S
(S (NP
probability
NT(S) p(nt(S) | top)
NT(NP) p(nt(NP) | (S)
stack action
(S
(S (NP
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
stack action
(S
(S (NP
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
stack action
(S
(S (NP
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
stack action
(S
(S (NP
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
stack action
(S
(S (NP
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
(S (NP The hungry cat )
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
stack action
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
(S (NP The hungry cat )
(S (NP The hungry cat)
Compress “The hungry cat”

into a single composite symbol
probability
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
stack action
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
probability
stack action
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
stack action
(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
stack action
GEN(meows)(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
stack action
GEN(meows)
REDUCE
(S (NP The hungry cat) (VP meows) GEN(.)
REDUCE
(S (NP The hungry cat) (VP meows) .)
(S (NP The hungry cat) (VP meows) .
(S (NP The hungry cat) (VP meows
(S (NP The hungry cat) (VP
(S (NP The hungry cat)
NT(S) p(nt(S) | top)
GEN(The) p(gen(The) | (S, (NP)
NT(NP) p(nt(NP) | (S)
GEN(hungry) p(gen(hungry) | (S, (NP,
The)
GEN(cat) p(gen(cat) | . . .)
REDUCE p(reduce | . . .)
(S
(S (NP
(S (NP The hungry cat
(S (NP The hungry
(S (NP The
NT(VP) p(nt(VP) | (S,
(NP The hungry cat)
probability
• Valid (tree, string) pairs are in bijection to valid sequences of
actions (specifically, the DFS, left-to-right traversal of the
trees)
• Every stack configuration perfectly encodes the complete
history of actions.
• Therefore, the probability decomposition is justified by the
chain rule, i.e.
(chain rule)
(prop 2)
p(x, y) = p(actions(x, y))
p(actions(x, y)) =
Y
i
p(ai | a<i)
=
Y
i
p(ai | stack(a<i))
(prop 1)
Some things you can show

Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. unbounded depth
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. unbounded depth
1. Unbounded depth → recurrent neural nets
h1 h2 h3 h4
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
h1 h2 h3 h4
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
h1 h2 h3 h4
2. arbitrarily complex trees
2. Arbitrarily complex trees → recursive neural nets
(NP The hungry cat)Need representation for:
Syntactic composition

(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic composition

The
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic composition

The hungry
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic composition

The hungry cat
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic composition

The hungry cat )
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic composition

TheNP hungry cat )
(NP The hungry cat)Need representation for:
Syntactic composition

TheNP hungry cat ) NP
(NP The hungry cat)Need representation for:
Syntactic composition

TheNP hungry cat ) NP
(NP The hungry cat)Need representation for:
(
Syntactic composition

TheNP hungry cat ) NP(
(NP The hungry cat)Need representation for:
Syntactic composition

TheNP cat ) NP(
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
hungry
Syntactic composition

Recursion
TheNP cat ) NP(
(NP The (ADJP very hungry) cat)
Need representation for: (NP The hungry cat)
| {z }
v
v
Syntactic composition

Recursion
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
h1 h2 h3 h4
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
⇠REDUCE
h1 h2 h3 h4
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
⇠REDUCE
(S (NP The hungry cat) (VP meows)p(ai+1 | )
Modeling the next action

(S (NP The hungry cat) (VP meowsp(ai | )
1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets
⇠REDUCE
(S (NP The hungry cat) (VP meows)p(ai+1 | )
3. limited updates
3. Limited updates to state → stack RNNs
• If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs

Inductive bias?
• If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs

Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
• If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs

Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
“keys”
z }| {
• If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs

Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
“keys”
z }| {
• If we accept the following two propositions…
• Sequential RNNs have a recency bias
• Syntactic composition learns to represent trees by
endocentric heads
• then we can say that they have a bias for syntactic recency
rather than sequential recency
RNNGs

Inductive bias?
(S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
The keys to the cabinet in the closet is/areSequential RNN:
“keys”
z }| {
• Generative models are great! We have defined a
joint distribution p(x,y) over strings (x) and parse
trees (y)
• We can look at two questions:
• What is p(x) for a given x? [language modeling]
• What is max p(y | x) for a given x? [parsing]
RNNGs

A few empirical results
English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)

Single model
Gen 91.1
Vinyals et al (2015)

PTB only
Disc 90.5
Shindo et al (2012)

Ensemble
Gen 92.4
Vinyals et al (2015)

Semisupervised
Disc+SemiS
up
92.8
Discriminative

PTB only
Disc 91.7
Generative

PTB only
Gen 93.6
Choe and Charniak (2016)

Semisupervised
Gen

+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)

Single model
Gen 91.1
Vinyals et al (2015)

PTB only
Disc 90.5
Shindo et al (2012)

Ensemble
Gen 92.4
Vinyals et al (2015)

Semisupervised
Disc+SemiS
up
92.8
Discriminative

PTB only
Disc 91.7
Generative

PTB only
Gen 93.6
Choe and Charniak (2016)

Semisupervised
Gen

+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
English PTB (Parsing)
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)

Single model
Gen 91.1
Vinyals et al (2015)

PTB only
Disc 90.5
Shindo et al (2012)

Ensemble
Gen 92.4
Vinyals et al (2015)

Semisupervised
Disc+SemiS
up
92.8
Discriminative

PTB only
Disc 91.7
Generative

PTB only
Gen 93.6
Choe and Charniak (2016)

Semisupervised
Gen

+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
Type F1
Petrov and Klein (2007) Gen 90.1
Shindo et al (2012)

Single model
Gen 91.1
Vinyals et al (2015)

PTB only
Disc 90.5
Shindo et al (2012)

Ensemble
Gen 92.4
Vinyals et al (2015)

Semisupervised
Disc+SemiS
up
92.8
Discriminative

PTB only
Disc 91.7
Generative

PTB only
Gen 93.6
Choe and Charniak (2016)

Semisupervised
Gen

+SemiSup
93.8
Fried et al. (2017)
Gen+Semi+
Ensemble
94.7
English PTB (Parsing)
Perplexity
5-gram IKN 169.3
LSTM + Dropout 113.4
Generative (IS) 102.4
English PTB (LM)
• RNNGs are effective for modeling language and parsing
• Finding 1: Generative parser > discriminative parser
• Better sample complexity
• No label bias in modeling action sequence
• Finding 2: RNNGs > RNNs
• PTB phrase structure is a useful inductive bias for making good
linguistic generalizations!
• Open question: Do RNNGs solve the Linzen, Dupoux, and
Goldberg problem?
Summary
• Recurrent neural networks are incredibly powerful models of
sequences (e.g., of words)
• In fact, RNNs are Turing complete!

(Siegelmann, 1995)
• But do they make good generalizations from finite samples of
data?
• What inductive biases do they have?
• Using syntactic annotations to improve inductive bias.
• Inferring better compositional structure without explicit
annotation
Recurrent neural networks

Good models of language?
(part 1)
(part 2)
• We have compared two end points: a sequence model and a
PTB-based syntactic model
• If we search for structure to obtain the best performance on down
stream tasks, what will it find?
• In this part of the talk, I focus on representation learning for
sentences.
Unsupervised structure

Do we need syntactic annotation?
Dani Yogatama
• Advances in deep learning have led to three predominant approaches for
constructing representations of sentences
Background
• Convolutional neural networks
• Recurrent neural networks
• Recursive neural networks
Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015
Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014
Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
• Advances in deep learning have led to three predominant approaches for
constructing representations of sentences
Background
• Convolutional neural networks
• Recurrent neural networks
• Recursive neural networks
Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015
Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014
Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
Recurrent Encoder
A boy drags his sleds through the snow
output
input
(word embeddings)
L = log p(y | x)
Recursive Encoder
A boy drags his sleds through the snow
output
input
(word embeddings)
Recursive Encoder
• Prior work on tree-structured neural networks assumed that the trees are
either provided as input or predicted based on explicit human
annotations (Socher et al., 2013; Bowman et al., 2016; Dyer et al., 2016)
• When children learn a new language, they are not given parse trees
• Infer tree structure as a latent variable that is marginalized during learning
of a downstream task that depends on the composed representation.
• Hierarchical structures can also be encoded into a word embedding
model to improve performance and interpretability (Yogatama et al., 2015)
L = log p(y | x)
= log
X
z2Z(x)
p(y, z | x)
Model
• Shift reduce parsing
A boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
Model
• Shift reduce parsing
A boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
Model
• Shift reduce parsing
boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
SHIFT
A
A
Tree
Model
• Shift reduce parsing
boy drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
A
Tree
Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
SHIFT
A
boy
A
Tree
boy
Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
boy
A
Tree
boy
Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
A
boy
REDUCE
A
Tree
boy
Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
REDUCE
i = (WI[hi, hj] + bI)
o = (WO[hi, hj] + bI)
fL = (WFL
[hi, hj] + bFL
)
fR = (WFR
[hi, hj] + bFR
)
g = tanh(WG[hi, hj] + bG)
c = fL ci + fR cj + i g
h = o c
• Compose top two elements of the stack with Tree LSTM (Tai et al., 2015
Zhu et al., 2015)
A
boy
Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
REDUCE
• Compose top two elements of the stack with Tree LSTM (Tai et al., 2015
Zhu et al., 2015)
Tree-LSTM(a, boy)
• Push the result back onto the stack
A
Tree
boy
Model
• Shift reduce parsing
drags his sleds
(Aho and Ullman, 1972)
Stack Queue
Tree-LSTM(a, boy)
A
Tree
boy
Model
• Shift reduce parsing
his sleds
(Aho and Ullman, 1972)
Stack Queue
Tree-LSTM(a, boy)
SHIFT
drags
A
Tree
boy drags
Model
• Shift reduce parsing (Aho and Ullman, 1972)
Stack Queue
REDUCE
Tree-LSTM(Tree-LSTM(a,boy), Tree-LSTM(drags, Tree-LSTM(his,sleds)))
A
Tree
boy drags his sleds
• Shift reduce parsing
Model
1 2 4 5
3 6
7
1 2 4 6
3
7
5
1 2 3 4
7
5
6
1 2 3 6
4
5
7
S, S, R, S, S, R, R S, S, S, R, R, S, R S, S, R, S, R, S, R S, S, S, S, R, R, R
Different Shift/Reduce sequences lead to different tree structures
(Aho and Ullman, 1972)
Learning
• How do we learn the policy for the shift reduce sequence?
• Supervised learning! Imitation learning!
• But we need a treebank.
• What if we don’t have labels?
Reinforcement Learning
• Given a state, and a set of possible actions, an agent needs to
decide what is the best possible action to take
• There is no supervision, only rewards which can be not observed until
after several actions are taken
Reinforcement Learning
• Given a state, and a set of possible actions, an agent needs to
decide what is the best possible action to take
• There is no supervision, only rewards which can be not observed until
after several actions are taken
• State
• Action
• A Shift-Reduce “agent”
• Reward
embeddings of top two elements of the stack,
embedding of head of the queue
shift, reduce
log likelihood on a downstream task given the
produced representation
Reinforcement Learning
• We use a simple policy gradient method, REINFORCE (Williams, 1992)
R(W) = E⇡(a,s;WR)
" TX
t=1
rtat
#
⇡(a | s; WR)• The goal of training is to train a policy network
to maximize rewards
• Other techniques for approximating the marginal likelihood are available
(Guu et al., 2017)
Stanford Sentiment Treebank
Method Accuracy
Naive Bayes (from Socher et al., 2013) 81.8
SVM (from Socher et al., 2013) 79.4
Average of Word Embeddings (from Socher et al., 2013) 80.1
Bayesian Optimization (Yogatama et al., 2015) 82.4
Weighted Average of Word Embeddings (Arora et al., 2017) 82.4
Left-to-Right LSTM 84.7
Right-to-Left LSTM 83.9
Bidirectional LSTM 84.7
Supervised Syntax 85.3
Semi-supervised Syntax 86.1
Latent Syntax 86.5
Socher et al., 2013
Stanford Sentiment Treebank
Method Accuracy
Naive Bayes (from Socher et al., 2013) 81.8
SVM (from Socher et al., 2013) 79.4
Average of Word Embeddings (from Socher et al., 2013) 80.1
Bayesian Optimization (Yogatama et al., 2015) 82.4
Weighted Average of Word Embeddings (Arora et al., 2017) 82.4
Left-to-Right LSTM 84.7
Right-to-Left LSTM 83.9
Bidirectional LSTM 84.7
Supervised Syntax 85.3
Semi-supervised Syntax 86.1
Latent Syntax 86.5
Socher et al., 2013
Examples of Learned Structures
intuitive structures
non-intuitive structures
Examples of Learned Structures
• Trees look “non linguistic”
• But downstream performance is great!
• Ignore the generation of the sentence in favor of
optimal structures for interpretation
• Natural languages must balance both.
Grammar Induction
Summary
• Do we need better bias in our models?
• Yes! They are making the wrong generalizations,
even from large data.
• Do we have to have the perfect model?
• No! Small steps in the right direction can pay big
dividends.
Linguistic Structure
Summary
Thanks!
Questions?

Contenu connexe

En vedette

State of Blockchain 2017: Smartnetworks and the Blockchain Economy
State of Blockchain 2017:  Smartnetworks and the Blockchain EconomyState of Blockchain 2017:  Smartnetworks and the Blockchain Economy
State of Blockchain 2017: Smartnetworks and the Blockchain EconomyMelanie Swan
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep LearningMelanie Swan
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
iPhone5c的最后猜测
iPhone5c的最后猜测iPhone5c的最后猜测
iPhone5c的最后猜测Yanbin Kong
 
Deep Learning for Chatbot (4/4)
Deep Learning for Chatbot (4/4)Deep Learning for Chatbot (4/4)
Deep Learning for Chatbot (4/4)Jaemin Cho
 
Hackathon 2014 NLP Hack
Hackathon 2014 NLP HackHackathon 2014 NLP Hack
Hackathon 2014 NLP HackRoelof Pieters
 
Cs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and UnderstandingCs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and UnderstandingYanbin Kong
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Advanced Node.JS Meetup
Advanced Node.JS MeetupAdvanced Node.JS Meetup
Advanced Node.JS MeetupLINAGORA
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning ExplainedMelanie Swan
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Blockchain Smartnetworks: Bitcoin and Blockchain Explained
Blockchain Smartnetworks: Bitcoin and Blockchain ExplainedBlockchain Smartnetworks: Bitcoin and Blockchain Explained
Blockchain Smartnetworks: Bitcoin and Blockchain ExplainedMelanie Swan
 
Cs231n 2017 lecture10 Recurrent Neural Networks
Cs231n 2017 lecture10 Recurrent Neural NetworksCs231n 2017 lecture10 Recurrent Neural Networks
Cs231n 2017 lecture10 Recurrent Neural NetworksYanbin Kong
 

En vedette (20)

State of Blockchain 2017: Smartnetworks and the Blockchain Economy
State of Blockchain 2017:  Smartnetworks and the Blockchain EconomyState of Blockchain 2017:  Smartnetworks and the Blockchain Economy
State of Blockchain 2017: Smartnetworks and the Blockchain Economy
 
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine TranslationRoee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
iPhone5c的最后猜测
iPhone5c的最后猜测iPhone5c的最后猜测
iPhone5c的最后猜测
 
Deep Learning for Chatbot (4/4)
Deep Learning for Chatbot (4/4)Deep Learning for Chatbot (4/4)
Deep Learning for Chatbot (4/4)
 
Hackathon 2014 NLP Hack
Hackathon 2014 NLP HackHackathon 2014 NLP Hack
Hackathon 2014 NLP Hack
 
Cs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and UnderstandingCs231n 2017 lecture12 Visualizing and Understanding
Cs231n 2017 lecture12 Visualizing and Understanding
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Advanced Node.JS Meetup
Advanced Node.JS MeetupAdvanced Node.JS Meetup
Advanced Node.JS Meetup
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Blockchain Smartnetworks: Bitcoin and Blockchain Explained
Blockchain Smartnetworks: Bitcoin and Blockchain ExplainedBlockchain Smartnetworks: Bitcoin and Blockchain Explained
Blockchain Smartnetworks: Bitcoin and Blockchain Explained
 
Cs231n 2017 lecture10 Recurrent Neural Networks
Cs231n 2017 lecture10 Recurrent Neural NetworksCs231n 2017 lecture10 Recurrent Neural Networks
Cs231n 2017 lecture10 Recurrent Neural Networks
 

Similaire à Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Reflect Linguistic Structure?

causativeppt.ppt
causativeppt.pptcausativeppt.ppt
causativeppt.pptAjiNug3
 
1. introduction to infinity
1. introduction to infinity1. introduction to infinity
1. introduction to infinityBiagio Tassone
 
Class4 - The Language Programming
Class4 - The Language ProgrammingClass4 - The Language Programming
Class4 - The Language ProgrammingNathacia Lucena
 
Towards a lingua universalis
Towards a lingua universalisTowards a lingua universalis
Towards a lingua universalisWalid Saba
 
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...Bodong Chen
 
7. TaMoS slides explanations and causes OLD.pdf
7. TaMoS slides explanations and causes OLD.pdf7. TaMoS slides explanations and causes OLD.pdf
7. TaMoS slides explanations and causes OLD.pdfYunChen83
 
Professor Michael Hoey: The hidden similarities across languages - some good ...
Professor Michael Hoey: The hidden similarities across languages - some good ...Professor Michael Hoey: The hidden similarities across languages - some good ...
Professor Michael Hoey: The hidden similarities across languages - some good ...eaquals
 
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsAssessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsSeriousGamesAssoc
 
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsAssessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsSeriousGamesAssoc
 
Aspects of Sentential Meaning.pptx
Aspects of Sentential Meaning.pptxAspects of Sentential Meaning.pptx
Aspects of Sentential Meaning.pptxkarlwinn1
 
Logic & critical thinking
Logic & critical thinking Logic & critical thinking
Logic & critical thinking AMIR HASSAN
 
usage based model.pptx
usage based model.pptxusage based model.pptx
usage based model.pptxYang829407
 
Bilingual Argumentative Analysis
Bilingual Argumentative AnalysisBilingual Argumentative Analysis
Bilingual Argumentative AnalysisLaura Johnson
 
Psychological processes in language acquisition
Psychological processes in language acquisitionPsychological processes in language acquisition
Psychological processes in language acquisitionsrnaz
 
Using Math Vocabulary With Intentionality While...
Using Math Vocabulary With Intentionality While...Using Math Vocabulary With Intentionality While...
Using Math Vocabulary With Intentionality While...Erin Rivera
 
BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...
BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...
BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...Walid Saba
 

Similaire à Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Reflect Linguistic Structure? (20)

causativeppt.ppt
causativeppt.pptcausativeppt.ppt
causativeppt.ppt
 
1. introduction to infinity
1. introduction to infinity1. introduction to infinity
1. introduction to infinity
 
Karlgren
KarlgrenKarlgren
Karlgren
 
Class4 - The Language Programming
Class4 - The Language ProgrammingClass4 - The Language Programming
Class4 - The Language Programming
 
Towards a lingua universalis
Towards a lingua universalisTowards a lingua universalis
Towards a lingua universalis
 
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...
KBSI 2015: Towards multimodal indicators of idea improvement in knowledge bui...
 
semantics nour.pptx
semantics nour.pptxsemantics nour.pptx
semantics nour.pptx
 
7. TaMoS slides explanations and causes OLD.pdf
7. TaMoS slides explanations and causes OLD.pdf7. TaMoS slides explanations and causes OLD.pdf
7. TaMoS slides explanations and causes OLD.pdf
 
Professor Michael Hoey: The hidden similarities across languages - some good ...
Professor Michael Hoey: The hidden similarities across languages - some good ...Professor Michael Hoey: The hidden similarities across languages - some good ...
Professor Michael Hoey: The hidden similarities across languages - some good ...
 
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsAssessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
 
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying RubricsAssessing Experiential Learning: Epic Finales and Roleplaying Rubrics
Assessing Experiential Learning: Epic Finales and Roleplaying Rubrics
 
Aspects of Sentential Meaning.pptx
Aspects of Sentential Meaning.pptxAspects of Sentential Meaning.pptx
Aspects of Sentential Meaning.pptx
 
Logic & critical thinking
Logic & critical thinking Logic & critical thinking
Logic & critical thinking
 
usage based model.pptx
usage based model.pptxusage based model.pptx
usage based model.pptx
 
Syntax
SyntaxSyntax
Syntax
 
Bilingual Argumentative Analysis
Bilingual Argumentative AnalysisBilingual Argumentative Analysis
Bilingual Argumentative Analysis
 
Lecture5 Meaning
Lecture5 MeaningLecture5 Meaning
Lecture5 Meaning
 
Psychological processes in language acquisition
Psychological processes in language acquisitionPsychological processes in language acquisition
Psychological processes in language acquisition
 
Using Math Vocabulary With Intentionality While...
Using Math Vocabulary With Intentionality While...Using Math Vocabulary With Intentionality While...
Using Math Vocabulary With Intentionality While...
 
BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...
BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...
BACK TO THE DRAWING BOARD - The Myth of Data-Driven NLU and How to go Forward...
 

Plus de Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Association for Computational Linguistics
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Association for Computational Linguistics
 
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian TranslationToshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian TranslationAssociation for Computational Linguistics
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemAssociation for Computational Linguistics
 
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Association for Computational Linguistics
 

Plus de Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
 
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian TranslationToshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian Translation
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
 
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
 

Dernier

How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational PhilosophyShuvankar Madhu
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice documentXsasf Sfdfasd
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 

Dernier (20)

How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational Philosophy
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice document
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 

Chris Dyer - 2017 - CoNLL Invited Talk: Should Neural Network Architecture Reflect Linguistic Structure?

  • 1. Should Neural Network Architecture Reflect Linguistic Structure? CoNLLAugust 3, 2017 Adhi Kuncoro
 (Oxford) Phil Blunsom
 (DeepMind) Dani Yogatama
 (DeepMind) Chris Dyer
 (DeepMind/CMU) Miguel Ballesteros
 (IBM) Wang Ling
 (DeepMind) Noah A. Smith
 (UW) Ed Grefenstette
 (DeepMind)
  • 2. Should Neural Network Architecture Reflect Linguistic Structure? Yes. CoNLLAugust 3, 2017
  • 4. (1) a. The talk I gave did not appeal to anybody. Modeling language
 Sentences are hierarchical Examples adapted from Everaert et al. (TICS 2015)
  • 5. (1) a. The talk I gave did not appeal to anybody. b. *The talk I gave appealed to anybody. Modeling language
 Sentences are hierarchical Examples adapted from Everaert et al. (TICS 2015)
  • 6. (1) a. The talk I gave did not appeal to anybody. b. *The talk I gave appealed to anybody. Modeling language
 Sentences are hierarchical NPI Examples adapted from Everaert et al. (TICS 2015)
  • 7. (1) a. The talk I gave did not appeal to anybody. b. *The talk I gave appealed to anybody. Generalization hypothesis: not must come before anybody Modeling language
 Sentences are hierarchical NPI Examples adapted from Everaert et al. (TICS 2015)
  • 8. (1) a. The talk I gave did not appeal to anybody. b. *The talk I gave appealed to anybody. Generalization hypothesis: not must come before anybody (2) *The talk I did not give appealed to anybody. Modeling language
 Sentences are hierarchical NPI Examples adapted from Everaert et al. (TICS 2015)
  • 9. Examples adapted from Everaert et al. (TICS 2015) containing anybody. (This structural configuration is called c(onstituent)-command in the linguistics literature [31].) When the relationship between not and anybody adheres to this structural configuration, the sentence is well-formed. In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence is not well-formed. (A) (B) X X X X X X The book X X X The book X appealed to anybody did not that I bought appeal to anybody that I did not buy Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item. (B) Negative polarity not licensed. Negative element does not c-command negative polarity item. - the psychological reality of structural sensitivty
 is not empirically controversial - many different theories of the details of structure - more controversial hypothesis: kids learn language
 easily because they don’t consider many “obvious” struct insensitive hypotheses Language is hierarchical The talk I gave did not appeal to anybody appealed to anybodyThe talk I did not give
  • 10. Examples adapted from Everaert et al. (TICS 2015) containing anybody. (This structural configuration is called c(onstituent)-command in the linguistics literature [31].) When the relationship between not and anybody adheres to this structural configuration, the sentence is well-formed. In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence is not well-formed. (A) (B) X X X X X X The book X X X The book X appealed to anybody did not that I bought appeal to anybody that I did not buy Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item. (B) Negative polarity not licensed. Negative element does not c-command negative polarity item. Generalization: not must “structurally precede” anybody - the psychological reality of structural sensitivty
 is not empirically controversial - many different theories of the details of structure - more controversial hypothesis: kids learn language
 easily because they don’t consider many “obvious” struct insensitive hypotheses Language is hierarchical The talk I gave did not appeal to anybody appealed to anybodyThe talk I did not give
  • 11. Examples adapted from Everaert et al. (TICS 2015) containing anybody. (This structural configuration is called c(onstituent)-command in the linguistics literature [31].) When the relationship between not and anybody adheres to this structural configuration, the sentence is well-formed. In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence is not well-formed. (A) (B) X X X X X X The book X X X The book X appealed to anybody did not that I bought appeal to anybody that I did not buy Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item. (B) Negative polarity not licensed. Negative element does not c-command negative polarity item. Generalization: not must “structurally precede” anybody - the psychological reality of structural sensitivty
 is not empirically controversial - many different theories of the details of structure - more controversial hypothesis: kids learn language
 easily because they don’t consider many “obvious” struct insensitive hypotheses Language is hierarchical The talk I gave did not appeal to anybody appealed to anybodyThe talk I did not give
  • 12. Examples adapted from Everaert et al. (TICS 2015) containing anybody. (This structural configuration is called c(onstituent)-command in the linguistics literature [31].) When the relationship between not and anybody adheres to this structural configuration, the sentence is well-formed. In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence is not well-formed. (A) (B) X X X X X X The book X X X The book X appealed to anybody did not that I bought appeal to anybody that I did not buy Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item. (B) Negative polarity not licensed. Negative element does not c-command negative polarity item. Generalization: not must “structurally precede” anybody - the psychological reality of structural sensitivty
 is not empirically controversial - many different theories of the details of structure - more controversial hypothesis: kids learn language
 easily because they don’t consider many “obvious” struct insensitive hypotheses Language is hierarchical The talk I gave did not appeal to anybody appealed to anybodyThe talk I did not give
  • 13. Examples adapted from Everaert et al. (TICS 2015) containing anybody. (This structural configuration is called c(onstituent)-command in the linguistics literature [31].) When the relationship between not and anybody adheres to this structural configuration, the sentence is well-formed. In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating not in Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentence is not well-formed. (A) (B) X X X X X X The book X X X The book X appealed to anybody did not that I bought appeal to anybody that I did not buy Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item. (B) Negative polarity not licensed. Negative element does not c-command negative polarity item. Generalization: not must “structurally precede” anybody - the psychological reality of structural sensitivty
 is not empirically controversial - many different theories of the details of structure - more controversial hypothesis: kids learn language
 easily because they don’t consider many “obvious”
 structural insensitive hypotheses Language is hierarchical The talk I gave did not appeal to anybody appealed to anybodyThe talk I did not give
  • 14. • Recurrent neural networks are incredibly powerful models of sequences (e.g., of words) • In fact, RNNs are Turing complete!
 (Siegelmann, 1995) • But do they make good generalizations from finite samples of data? • What inductive biases do they have? • Using syntactic annotations to improve inductive bias. • Inferring better compositional structure without explicit annotation Recurrent neural networks
 Good models of language? (part 1) (part 2)
  • 15. • Recurrent neural networks are incredibly powerful models of sequences (e.g., of words) • In fact, RNNs are Turing complete!
 (Siegelmann, 1995) • But do they make good generalizations from finite samples of data? • What inductive biases do they have? • Using syntactic annotations to improve inductive bias. • Inferring better compositional structure without explicit annotation Recurrent neural networks
 Good models of language? (part 1) (part 2)
  • 16. • Understanding the biases of neural networks is tricky • We have enough trouble understanding the representations they learn in specific cases, much less general cases! • But, there is lots of evidence RNNs prefer sequential recency • Evidence 1: Gradients become attenuated across time • Analysis; experiments with synthetic datasets
 (yes, LSTMs help but they have limits) • Evidence 2: Training regimes like reversing sequences in seq2seq learning • Evidence 3: Modeling enhancements to use attention (direct connections back in remote time) • Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement. Recurrent neural networks
 Inductive bias
  • 17. • Understanding the biases of neural networks is tricky • We have enough trouble understanding the representations they learn in specific cases, much less general cases! • But, there is lots of evidence RNNs have a bias for sequential recency • Evidence 1: Gradients become attenuated across time • Functional analysis; experiments with synthetic datasets
 (yes, LSTMs/GRUs help, but they also forget) • Evidence 2: Training regimes like reversing sequences in seq2seq learning • Evidence 3: Modeling enhancements to use attention (direct connections back in remote time) • Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement. Recurrent neural networks
 Inductive bias
  • 18. The keys to the cabinet in the closet is/are on the table. AGREE Linzen, Dupoux, Goldberg (2017)

  • 19. • Understanding the biases of neural networks is tricky • We have enough trouble understanding the representations they learn in specific cases, much less general cases! • But, there is lots of evidence RNNs have a bias for sequential recency • Evidence 1: Gradients become attenuated across time • Functional analysis; experiments with synthetic datasets
 (yes, LSTMs/GRUs help, but they also forget) • Evidence 2: Training regimes like reversing sequences in seq2seq learning • Evidence 3: Modeling enhancements to use attention (direct connections back in remote time) • Evidence 4: Linzen et al. (2017) findings on English subject-verb agreement. Recurrent neural networks
 Inductive bias Chomsky (crudely paraphrasing 60 years of work):
 
 Sequential recency is not the right bias for
 modeling human language.
  • 20. An alternative
 Recurrent Neural Net Grammars • Generate symbols sequentially using an RNN • Add some control symbols to rewrite the history occasionally • Occasionally compress a sequence into a constituent • RNN predicts next terminal/control symbol based on the history of compressed elements and non-compressed terminals • This is a top-down, left-to-right generation of a tree+sequence Adhi Kuncoro
  • 21. An alternative
 Recurrent Neural Net Grammars • Generate symbols sequentially using an RNN • Add some control symbols to rewrite the history occasionally • Occasionally compress a sequence into a constituent • RNN predicts next terminal/control symbol based on the history of compressed elements and non-compressed terminals • This is a top-down, left-to-right generation of a tree+sequence Adhi Kuncoro
  • 22. • Generate symbols sequentially using an RNN • Add some control symbols to rewrite the history occasionally • Occasionally compress a sequence into a constituent • RNN predicts next terminal/control symbol based on the history of compressed elements and non-compressed terminals • This is a top-down, left-to-right generation of a tree+sequence An alternative
 Recurrent Neural Net Grammars Adhi Kuncoro
  • 23. The hungry cat meows loudly Example derivation

  • 27. stack action (S probability NT(S) p(nt(S) | top) NT(NP) p(nt(NP) | (S)
  • 28. stack action (S (S (NP probability NT(S) p(nt(S) | top) NT(NP) p(nt(NP) | (S)
  • 29. stack action (S (S (NP probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S)
  • 30. stack action (S (S (NP (S (NP The probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S)
  • 31. stack action (S (S (NP (S (NP The probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The)
  • 32. stack action (S (S (NP (S (NP The hungry (S (NP The probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The)
  • 33. stack action (S (S (NP (S (NP The hungry (S (NP The probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .)
  • 34. stack action (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .)
  • 35. stack action (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .)
  • 36. stack action (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The (S (NP The hungry cat ) probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .)
  • 37. stack action (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The (S (NP The hungry cat ) (S (NP The hungry cat) Compress “The hungry cat”
 into a single composite symbol probability NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .)
  • 38. stack action (S (NP The hungry cat) NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .) (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The probability
  • 39. stack action (S (NP The hungry cat) NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .) (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The NT(VP) p(nt(VP) | (S, (NP The hungry cat) probability
  • 40. stack action (S (NP The hungry cat) (VP (S (NP The hungry cat) NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .) (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The NT(VP) p(nt(VP) | (S, (NP The hungry cat) probability
  • 41. stack action GEN(meows)(S (NP The hungry cat) (VP (S (NP The hungry cat) NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .) (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The NT(VP) p(nt(VP) | (S, (NP The hungry cat) probability
  • 42. stack action GEN(meows) REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) (S (NP The hungry cat) (VP meows) . (S (NP The hungry cat) (VP meows (S (NP The hungry cat) (VP (S (NP The hungry cat) NT(S) p(nt(S) | top) GEN(The) p(gen(The) | (S, (NP) NT(NP) p(nt(NP) | (S) GEN(hungry) p(gen(hungry) | (S, (NP, The) GEN(cat) p(gen(cat) | . . .) REDUCE p(reduce | . . .) (S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The NT(VP) p(nt(VP) | (S, (NP The hungry cat) probability
  • 43. • Valid (tree, string) pairs are in bijection to valid sequences of actions (specifically, the DFS, left-to-right traversal of the trees) • Every stack configuration perfectly encodes the complete history of actions. • Therefore, the probability decomposition is justified by the chain rule, i.e. (chain rule) (prop 2) p(x, y) = p(actions(x, y)) p(actions(x, y)) = Y i p(ai | a<i) = Y i p(ai | stack(a<i)) (prop 1) Some things you can show

  • 44. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | )
  • 45. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. unbounded depth
  • 46. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. unbounded depth 1. Unbounded depth → recurrent neural nets h1 h2 h3 h4
  • 47. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. Unbounded depth → recurrent neural nets h1 h2 h3 h4
  • 48. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. Unbounded depth → recurrent neural nets h1 h2 h3 h4 2. arbitrarily complex trees 2. Arbitrarily complex trees → recursive neural nets
  • 49. (NP The hungry cat)Need representation for: Syntactic composition

  • 50. (NP The hungry cat)Need representation for: NP What head type? Syntactic composition

  • 51. The (NP The hungry cat)Need representation for: NP What head type? Syntactic composition

  • 52. The hungry (NP The hungry cat)Need representation for: NP What head type? Syntactic composition

  • 53. The hungry cat (NP The hungry cat)Need representation for: NP What head type? Syntactic composition

  • 54. The hungry cat ) (NP The hungry cat)Need representation for: NP What head type? Syntactic composition

  • 55. TheNP hungry cat ) (NP The hungry cat)Need representation for: Syntactic composition

  • 56. TheNP hungry cat ) NP (NP The hungry cat)Need representation for: Syntactic composition

  • 57. TheNP hungry cat ) NP (NP The hungry cat)Need representation for: ( Syntactic composition

  • 58. TheNP hungry cat ) NP( (NP The hungry cat)Need representation for: Syntactic composition

  • 59. TheNP cat ) NP( (NP The (ADJP very hungry) cat) Need representation for: (NP The hungry cat) hungry Syntactic composition
 Recursion
  • 60. TheNP cat ) NP( (NP The (ADJP very hungry) cat) Need representation for: (NP The hungry cat) | {z } v v Syntactic composition
 Recursion
  • 61. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. Unbounded depth → recurrent neural nets 2. Arbitrarily complex trees → recursive neural nets h1 h2 h3 h4
  • 62. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. Unbounded depth → recurrent neural nets 2. Arbitrarily complex trees → recursive neural nets ⇠REDUCE h1 h2 h3 h4
  • 63. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. Unbounded depth → recurrent neural nets 2. Arbitrarily complex trees → recursive neural nets ⇠REDUCE (S (NP The hungry cat) (VP meows)p(ai+1 | )
  • 64. Modeling the next action
 (S (NP The hungry cat) (VP meowsp(ai | ) 1. Unbounded depth → recurrent neural nets 2. Arbitrarily complex trees → recursive neural nets ⇠REDUCE (S (NP The hungry cat) (VP meows)p(ai+1 | ) 3. limited updates 3. Limited updates to state → stack RNNs
  • 65. • If we accept the following two propositions… • Sequential RNNs have a recency bias • Syntactic composition learns to represent trees by endocentric heads • then we can say that they have a bias for syntactic recency rather than sequential recency RNNGs
 Inductive bias?
  • 66. • If we accept the following two propositions… • Sequential RNNs have a recency bias • Syntactic composition learns to represent trees by endocentric heads • then we can say that they have a bias for syntactic recency rather than sequential recency RNNGs
 Inductive bias? (S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are
  • 67. • If we accept the following two propositions… • Sequential RNNs have a recency bias • Syntactic composition learns to represent trees by endocentric heads • then we can say that they have a bias for syntactic recency rather than sequential recency RNNGs
 Inductive bias? (S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are “keys” z }| {
  • 68. • If we accept the following two propositions… • Sequential RNNs have a recency bias • Syntactic composition learns to represent trees by endocentric heads • then we can say that they have a bias for syntactic recency rather than sequential recency RNNGs
 Inductive bias? (S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are “keys” z }| {
  • 69. • If we accept the following two propositions… • Sequential RNNs have a recency bias • Syntactic composition learns to represent trees by endocentric heads • then we can say that they have a bias for syntactic recency rather than sequential recency RNNGs
 Inductive bias? (S (NP The keys (PP to the cabinet (PP in the closet))) (VP is/are The keys to the cabinet in the closet is/areSequential RNN: “keys” z }| {
  • 70. • Generative models are great! We have defined a joint distribution p(x,y) over strings (x) and parse trees (y) • We can look at two questions: • What is p(x) for a given x? [language modeling] • What is max p(y | x) for a given x? [parsing] RNNGs
 A few empirical results
  • 71. English PTB (Parsing) Type F1 Petrov and Klein (2007) Gen 90.1 Shindo et al (2012)
 Single model Gen 91.1 Vinyals et al (2015)
 PTB only Disc 90.5 Shindo et al (2012)
 Ensemble Gen 92.4 Vinyals et al (2015)
 Semisupervised Disc+SemiS up 92.8 Discriminative
 PTB only Disc 91.7 Generative
 PTB only Gen 93.6 Choe and Charniak (2016)
 Semisupervised Gen
 +SemiSup 93.8 Fried et al. (2017) Gen+Semi+ Ensemble 94.7
  • 72. English PTB (Parsing) Type F1 Petrov and Klein (2007) Gen 90.1 Shindo et al (2012)
 Single model Gen 91.1 Vinyals et al (2015)
 PTB only Disc 90.5 Shindo et al (2012)
 Ensemble Gen 92.4 Vinyals et al (2015)
 Semisupervised Disc+SemiS up 92.8 Discriminative
 PTB only Disc 91.7 Generative
 PTB only Gen 93.6 Choe and Charniak (2016)
 Semisupervised Gen
 +SemiSup 93.8 Fried et al. (2017) Gen+Semi+ Ensemble 94.7
  • 73. English PTB (Parsing) Type F1 Petrov and Klein (2007) Gen 90.1 Shindo et al (2012)
 Single model Gen 91.1 Vinyals et al (2015)
 PTB only Disc 90.5 Shindo et al (2012)
 Ensemble Gen 92.4 Vinyals et al (2015)
 Semisupervised Disc+SemiS up 92.8 Discriminative
 PTB only Disc 91.7 Generative
 PTB only Gen 93.6 Choe and Charniak (2016)
 Semisupervised Gen
 +SemiSup 93.8 Fried et al. (2017) Gen+Semi+ Ensemble 94.7
  • 74. Type F1 Petrov and Klein (2007) Gen 90.1 Shindo et al (2012)
 Single model Gen 91.1 Vinyals et al (2015)
 PTB only Disc 90.5 Shindo et al (2012)
 Ensemble Gen 92.4 Vinyals et al (2015)
 Semisupervised Disc+SemiS up 92.8 Discriminative
 PTB only Disc 91.7 Generative
 PTB only Gen 93.6 Choe and Charniak (2016)
 Semisupervised Gen
 +SemiSup 93.8 Fried et al. (2017) Gen+Semi+ Ensemble 94.7 English PTB (Parsing)
  • 75. Perplexity 5-gram IKN 169.3 LSTM + Dropout 113.4 Generative (IS) 102.4 English PTB (LM)
  • 76. • RNNGs are effective for modeling language and parsing • Finding 1: Generative parser > discriminative parser • Better sample complexity • No label bias in modeling action sequence • Finding 2: RNNGs > RNNs • PTB phrase structure is a useful inductive bias for making good linguistic generalizations! • Open question: Do RNNGs solve the Linzen, Dupoux, and Goldberg problem? Summary
  • 77. • Recurrent neural networks are incredibly powerful models of sequences (e.g., of words) • In fact, RNNs are Turing complete!
 (Siegelmann, 1995) • But do they make good generalizations from finite samples of data? • What inductive biases do they have? • Using syntactic annotations to improve inductive bias. • Inferring better compositional structure without explicit annotation Recurrent neural networks
 Good models of language? (part 1) (part 2)
  • 78. • We have compared two end points: a sequence model and a PTB-based syntactic model • If we search for structure to obtain the best performance on down stream tasks, what will it find? • In this part of the talk, I focus on representation learning for sentences. Unsupervised structure
 Do we need syntactic annotation? Dani Yogatama
  • 79. • Advances in deep learning have led to three predominant approaches for constructing representations of sentences Background • Convolutional neural networks • Recurrent neural networks • Recursive neural networks Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015 Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014 Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
  • 80. • Advances in deep learning have led to three predominant approaches for constructing representations of sentences Background • Convolutional neural networks • Recurrent neural networks • Recursive neural networks Kim, 2014; Kalchbrenner et al., 2014; Ma et al., 2015 Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2014 Socher et al., 2011; Socher et al., 2013;Tai et al., 2015; Bowman et al., 2016
  • 81. Recurrent Encoder A boy drags his sleds through the snow output input (word embeddings) L = log p(y | x)
  • 82. Recursive Encoder A boy drags his sleds through the snow output input (word embeddings)
  • 83. Recursive Encoder • Prior work on tree-structured neural networks assumed that the trees are either provided as input or predicted based on explicit human annotations (Socher et al., 2013; Bowman et al., 2016; Dyer et al., 2016) • When children learn a new language, they are not given parse trees • Infer tree structure as a latent variable that is marginalized during learning of a downstream task that depends on the composed representation. • Hierarchical structures can also be encoded into a word embedding model to improve performance and interpretability (Yogatama et al., 2015) L = log p(y | x) = log X z2Z(x) p(y, z | x)
  • 84. Model • Shift reduce parsing A boy drags his sleds (Aho and Ullman, 1972) Stack Queue
  • 85. Model • Shift reduce parsing A boy drags his sleds (Aho and Ullman, 1972) Stack Queue
  • 86. Model • Shift reduce parsing boy drags his sleds (Aho and Ullman, 1972) Stack Queue SHIFT A A Tree
  • 87. Model • Shift reduce parsing boy drags his sleds (Aho and Ullman, 1972) Stack Queue A A Tree
  • 88. Model • Shift reduce parsing drags his sleds (Aho and Ullman, 1972) Stack Queue SHIFT A boy A Tree boy
  • 89. Model • Shift reduce parsing drags his sleds (Aho and Ullman, 1972) Stack Queue A boy A Tree boy
  • 90. Model • Shift reduce parsing drags his sleds (Aho and Ullman, 1972) Stack Queue A boy REDUCE A Tree boy
  • 91. Model • Shift reduce parsing drags his sleds (Aho and Ullman, 1972) Stack Queue REDUCE i = (WI[hi, hj] + bI) o = (WO[hi, hj] + bI) fL = (WFL [hi, hj] + bFL ) fR = (WFR [hi, hj] + bFR ) g = tanh(WG[hi, hj] + bG) c = fL ci + fR cj + i g h = o c • Compose top two elements of the stack with Tree LSTM (Tai et al., 2015 Zhu et al., 2015) A boy
  • 92. Model • Shift reduce parsing drags his sleds (Aho and Ullman, 1972) Stack Queue REDUCE • Compose top two elements of the stack with Tree LSTM (Tai et al., 2015 Zhu et al., 2015) Tree-LSTM(a, boy) • Push the result back onto the stack A Tree boy
  • 93. Model • Shift reduce parsing drags his sleds (Aho and Ullman, 1972) Stack Queue Tree-LSTM(a, boy) A Tree boy
  • 94. Model • Shift reduce parsing his sleds (Aho and Ullman, 1972) Stack Queue Tree-LSTM(a, boy) SHIFT drags A Tree boy drags
  • 95. Model • Shift reduce parsing (Aho and Ullman, 1972) Stack Queue REDUCE Tree-LSTM(Tree-LSTM(a,boy), Tree-LSTM(drags, Tree-LSTM(his,sleds))) A Tree boy drags his sleds
  • 96. • Shift reduce parsing Model 1 2 4 5 3 6 7 1 2 4 6 3 7 5 1 2 3 4 7 5 6 1 2 3 6 4 5 7 S, S, R, S, S, R, R S, S, S, R, R, S, R S, S, R, S, R, S, R S, S, S, S, R, R, R Different Shift/Reduce sequences lead to different tree structures (Aho and Ullman, 1972)
  • 97. Learning • How do we learn the policy for the shift reduce sequence? • Supervised learning! Imitation learning! • But we need a treebank. • What if we don’t have labels?
  • 98. Reinforcement Learning • Given a state, and a set of possible actions, an agent needs to decide what is the best possible action to take • There is no supervision, only rewards which can be not observed until after several actions are taken
  • 99. Reinforcement Learning • Given a state, and a set of possible actions, an agent needs to decide what is the best possible action to take • There is no supervision, only rewards which can be not observed until after several actions are taken • State • Action • A Shift-Reduce “agent” • Reward embeddings of top two elements of the stack, embedding of head of the queue shift, reduce log likelihood on a downstream task given the produced representation
  • 100. Reinforcement Learning • We use a simple policy gradient method, REINFORCE (Williams, 1992) R(W) = E⇡(a,s;WR) " TX t=1 rtat # ⇡(a | s; WR)• The goal of training is to train a policy network to maximize rewards • Other techniques for approximating the marginal likelihood are available (Guu et al., 2017)
  • 101. Stanford Sentiment Treebank Method Accuracy Naive Bayes (from Socher et al., 2013) 81.8 SVM (from Socher et al., 2013) 79.4 Average of Word Embeddings (from Socher et al., 2013) 80.1 Bayesian Optimization (Yogatama et al., 2015) 82.4 Weighted Average of Word Embeddings (Arora et al., 2017) 82.4 Left-to-Right LSTM 84.7 Right-to-Left LSTM 83.9 Bidirectional LSTM 84.7 Supervised Syntax 85.3 Semi-supervised Syntax 86.1 Latent Syntax 86.5 Socher et al., 2013
  • 102. Stanford Sentiment Treebank Method Accuracy Naive Bayes (from Socher et al., 2013) 81.8 SVM (from Socher et al., 2013) 79.4 Average of Word Embeddings (from Socher et al., 2013) 80.1 Bayesian Optimization (Yogatama et al., 2015) 82.4 Weighted Average of Word Embeddings (Arora et al., 2017) 82.4 Left-to-Right LSTM 84.7 Right-to-Left LSTM 83.9 Bidirectional LSTM 84.7 Supervised Syntax 85.3 Semi-supervised Syntax 86.1 Latent Syntax 86.5 Socher et al., 2013
  • 103. Examples of Learned Structures intuitive structures
  • 105. • Trees look “non linguistic” • But downstream performance is great! • Ignore the generation of the sentence in favor of optimal structures for interpretation • Natural languages must balance both. Grammar Induction Summary
  • 106. • Do we need better bias in our models? • Yes! They are making the wrong generalizations, even from large data. • Do we have to have the perfect model? • No! Small steps in the right direction can pay big dividends. Linguistic Structure Summary