Neural networks have a long and rich history in automatic speech recognition. In this talk, we present a brief primer on the origin of deep learning in spoken language, and then explore today’s world of Alexa. Alexa is the AWS service that understands spoken language and powers Amazon Echo. Alexa relies heavily on machine learning and deep neural networks for speech recognition, text-to-speech, language understanding, and more. We also discuss the Alexa Skills Kit, which lets any developer teach Alexa new skills.
2. Outline
• History of Deep Learning
• Deep Learning in Alexa
• The Alexa Skills Kit
3. Intense academic
activity
“Neural winter” The “GPU era"
History of Deep Learning
1986 1998 2007 20162014
Amazon
Echo
launches!
Hinton, Rumelhart
and Williams invent
backpropagation
training
4. Multilayer perceptron
input, x
output, y
“input layer”
“hidden layer”
“hidden layer”
“output layer”
h1 = sigmoid(A1x+b1)
h2 = sigmoid(A2h1+b2)
y = sigmoid(Aoh2+bo)
x
5. Mohamed, Dahl and Hinton beat a
well-known speech recognition
benchmark (TIMIT)
Neural winter
Deep Learning milestones
1986 1998 2009 2010 2016
Krizhevsky, Sutskever, and
Hinton win the ImageNet
object recognition challenge.
AlphaGo beats a Go
World Champion
Microsoft and Google
demonstrate breakthrough
results on large vocabulary
speech recognition.
Hinton, Rumelhart
and Williams
Salakhutdinov and
Hinton discover a
method to train very
deep neural
networks.
2002 2011
LeCun, Bottou,
Bengio and Haffner
publish CNN for
Computer Vision
1997
Hochreiter and
Schmidthuber invent LSTM
for recurrent networks with
long memory.
6. Neural winter
Deep Learning in Speech Recognition
1986 1998 2009 2010 20162002 2011
Mohamed, Dahl and Hinton beat a
well-known speech recognition
benchmark (TIMIT)
Microsoft and Google
demonstrate breakthrough
results on large vocabulary
speech recognition.
‘96‘91 ‘92‘89
Waibel, Hanazawa,
Hinton, Shikano, and
Lang publish time-
delay neural network
(TDNN).
Strom combines
time-delay NN and
RNN (RTDNN)
Strom introduces
speaker vectors for
speaker adaptation
Robinson demonstrates
RNN for ASR and get the
best result on TIMIT so far.
Bourlard, Morgan, Wooters and
Renals introduce context
dependent MLP models.
7. Impact of data corpus size
= 140,160 hours16 years
≈14,016 hours of speech
10. Neural winter
Impact of compute infrastructure
1986 1998 2007 2012 2016
Reign of EM
• During the “neural winter,” EM became a dominant distributed
computing paradigm for machine learning (ML)
• ML algorithms that use the EM algorithms benefited greatly
• Distributed SGD broke out Deep Learning from the single box
Distributed SGD
StromDean et al.
2015
11. Conclusion – how we got here
• Theory and algorithm design in the 80s and 90s
• Orders of magnitude more data available
• Orders of magnitude more computational capacity
• A few algorithmic inventions enabled deep networks
• The rise of distributed SGD training
We are in a period of massive Deep Learning adoption because:
13. Large-scale distributed training
Up to 80 EC2 g2.2xlarge GPU
instances working in sync to train
a model
Thousands of
hours of speech
training data stored
in Amazon S3
14. Large-scale distributed training
All nodes must communicate
updates to the model to all
other nodes.
GPUs compute model
updates fast – Think updates
per second
A model update is hundreds
of MB
15. 0
100,000
200,000
300,000
400,000
500,000
600,000
0 20 40 60 80
Framespersecond
Number of GPU workers
DNN training speed
Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015.
20. Intent and entities
play two steps behind by def leppard
Intent
PlayMusic
Entities
Song Artist
Two problems:
1. Words are symbols – not vectors of numbers
2. Requests are of different lengths
26. Prosody for natural sounding reading
Bi-directional recurrent network
pitch duration
• Phonetic features
• Linguistic features
• Semantic word vectors
targets for segment
intensity
27. Long-form example
“Over a lunch of diet cokes and lobster salad one
balmy fall day in Boston, Joseph Martin, the
genial, white-haired, former dean of Harvard
medical school, told me how many hours of pain
education Harvard med students get during four
years of medical school.”
Before After
33. ASK for Developers
• Define a Voice User Interface
• Provide a finite number of sample utterances
• ASK automatically builds and deploys
machine learning models
36. Model Building
Finite-state transducers (FSTs)
(exact match)
ML Entity Recognizer
ML Intent Recognizer
Developer Input
We build two models: FSTs are for exact matches,
machine learning models for fuzzy matches.
37. ASK Machine Learning
ASK
Machine
Learning
Model
hey uhm i need a
car to starbucks
Training: Finite number
of sample utterances
MATCH TRAIN
Runtime: Infinite number of
possible utterances
DevelopersCustomers
get a car to <Destination>
get me a car
…
38. • Neural Networks (NNs)
• Transfer Learning:
• Use knowledge learned from
large related training data
• Example: We’ve seen slots
like <Destination> before, no
need to learn from scratch.
get a car to <Destination>
get me a car
…
ASK Machine Learning (contd.)
39. How to Write Great Skills
Slots
• Catalogs: Provide as many values as possible.
Add representative values of different lengths where
appropriate
• Use built-in slots where possible
(e.g., cities, states, first names)
• Do not use too many slots in one utterance
(rather ask for missing slots in a dialog)
• Use context around each slot
40. How to Write Great Skills
Intents
• Split heterogeneous intents
• Use built-in intents where possible
• Provide as many carrier phrases as possible
• Use Thesaurus or paraphrasing tools, ask your friends or
mechanical turk for utterances
41. Conclusions
• ASK connects developers to customers
• Developers constantly extend Alexa’s capabilities
• We constantly get more data and improve experience
via machine learning
• Making Alexa more intelligent and powerful, bridging
the gap between human and machine
47. Images used
Macaw. Public domain. https://pixabay.com/en/macaw-bird-beak-parrot-650638/
VW. Free for editorial use. http://media.vw.com/images/category/11/
48. Images used
ASCI Red. Public domain. https://commons.wikimedia.org/wiki/File:Asci_red_-_tflop4m.jpeg
8800 GTX. Permission by email by Tri Hyunth at Nvidia.