SlideShare une entreprise Scribd logo
1  sur  95
Squeezing Deep Learning into mobile phones
- A Practitioners guide
Anirudh Koul
Anirudh Koul , @anirudhkoul , http://koul.ai
Project Lead, Seeing AI (SeeingAI.com)
Applied Researcher, Microsoft AI & Research
Akoul at Microsoft dot com
Currently working on applying artificial intelligence for
Hololens, autonomous robots and accessibility
Along with Eugene Seleznev, Saqib Shaikh, Meher Kasam
Why Deep Learning On Mobile?
Latency Privacy
Response Time Limits – Powers of 10
0.1 second : Reacting instantly
1.0 seconds : User’s flow of thought
10 seconds : Keeping the user’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
Mobile Deep Learning Recipe
Mobile Inference Engine + Pretrained Model = DL App
(Efficient) (Efficient)
Building a DL App in _ time
Building a DL App in 1 hour
Use Cloud APIs
Microsoft Cognitive Services
Clarifai
Google Cloud Vision
IBM Watson Services
Amazon Rekognition
Tip : Resize to 224x224 at under 50% compression with bilinear interpolation
before network transmission
But don’t resize for Text / OCR projects
Microsoft Cognitive Services
Models won the 2015 ImageNet Large Scale Visual Recognition Challenge
Vision, Face, Emotion, Video and 21 other topics
Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Upload 30 photos per class for make prototype model
Upload 200 photos per class for more robust production model
More distinct the shape/type of object, lesser images required.
Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Use Fatkun Browser Extension to download images from Search Engine,
or use Bing Image Search API to programmatically download photos with
proper rights
Building a DL App in 1 day
http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/
Energy to train
Convolutional
Neural Network
Energy to use
Convolutional
Neural Network
Base PreTrained Model
ImageNet – 1000 Object Categorizer
Inception
Resnet
Running pre-trained models on mobile
Core ML
Tensorflow
Caffe2
Snapdragon Neural Processing Engine
MXNet
CNNDroid
DeepLearningKit
Torch
Core ML
From Apple, for iOS 11
Convert Caffe/Tensorflow model to CoreML model in 3 lines:
import coremltools
coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)
coreml_model.save('my_model.mlmodel’)
Add model to iOS project and call for prediction.
Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM
Builds on top of low-level primitives
Accelerate, BNNS, Metal Performance Shaders (MPS)
Noticable speedup between MPS (iOS 10) vs CoreML implementation (iOS 11)
(same model, same hardware)
Automatically minimizes memory footprint and power consumption
CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accuracy
Size of Model
(MB)
iPhone 6
Execution
Time (ms)
iPhone 6S
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
VGG 16 71 553 4556 254 208
Inception v3 78 95 637 98 90
Resnet 50 75 103 557 72 64
MobileNet 71 17 109 52 32
SqueezeNet 57 5 78 29 24
2014 2015 2016
Huge
improvement
in hardware
in 2015
Putting out more frames than an art gallery
Tensorflow
Easy pipeline to bring Tensorflow models to mobile
Excellent documentation
Optimizations to bring model to mobile
Upcoming : XLA (Accelerated Linear Algebra)
compiler to optimize for hardware
Caffe2
From Facebook
Under 1 MB of binary size
Built for Speed :
For ARM CPU : Uses NEON Kernels, NNPack
For iPhone GPU : Uses Metal Performance Shaders and Metal
For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)
ONNX format support to import models from CNTK/PyTorch
Snapdragon Neural Processing Engine (NPE) SDK
Like CoreML for Qualcomm Snapdragon chips
Published speedup of 4-5x
On about half the Android phones
Identifies best target core for inference - GPU, DSP or CPU
Customizable to choose between battery power and performance
Supports importing models from Caffe, Caffe2, Tensorflow
MXNET
Amalgamation : Pack all the code in a single source file
Pro:
• Cross Platform (iOS, Android), Easy porting
• Usable in any programming language
Con:
• CPU only, Slow
https://github.com/Leliana/WhatsThis
CNNdroid
GPU accelerated CNNs for Android
Supports Caffe, Torch and Theano models
~30-40x Speedup using mobile GPU vs CPU (AlexNet)
Internally, CNNdroid expresses data parallelism for different layers, instead
of leaving to the GPU’s hardware scheduler
DeepLearningKit
Platform : iOS, OS X and tvOS (Apple TV)
DNN Type : CNNs models trained in Caffe
Runs on mobile GPU, uses Metal
Pro : Fast, directly ingests Caffe models
Con : Unmaintained
Running pre-trained models on mobile
Mobile Library Platform GPU DNN Architecture
Supported
Trained Models
Supported
CoreML iOS Yes CNN, RNN, SciKit Keras, Tensorflow,
MXNet
Tensorflow iOS/Android Yes CNN,RNN,LSTM, etc Tensorflow
Caffe2 iOS/Android Yes CNN Caffe2, CNTK, PyTorch
Snapdragon NPE Android Yes CNN, RNN, LSTM Caffe, Caffe2,
Tensorflow
CNNDroid Android Yes CNN Caffe, Torch, Theano
DeepLearningKit iOS Yes CNN Caffe
MXNet iOS/Android No CNN,RNN,LSTM, etc MXNet
Torch iOS/Android No CNN,RNN,LSTM, etc Torch
Possible Long Term Route for fastest speed on each phone
Train a model using your favorite DNN library
Import it to Tensorflow/Keras with Tensorflow
For iOS :
Use Keras and CoreML
For Android :
For Qualcomm chips (~50% of android phones) :
Using Snapdragon NPE
For remaining phones
Use Tensorflow Mobile
Model (Tensorflow format)
Keras +
CoreML
Snapdragon
NPE
Tensorflow
Mobile
iOS Android
Qualcomm
Chips
Remaining
Building a DL App in 1 week
Learn Playing an Accordion
3 months
Learn Playing an Accordion
3 months
Knows Piano
Fine Tune Skills
1 week
I got a dataset, Now What?
Step 1 : Find a pre-trained model
Step 2 : Fine tune a pre-trained model
Step 3 : Run using existing frameworks
“Don’t Be A Hero”
- Andrej Karpathy
How to find pretrained models for my task?
Search “Model Zoo”
Microsoft Cognitive Toolkit (previously called CNTK) – 50 Models
Caffe Model Zoo
Keras
Tensorflow
MXNet
AlexNet, 2012 (simplified)
[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimension
Feature
representation
Deciding how to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
CoreML exporter from customvision.ai – Drag and drop training
5 minute shortcut to finetuning and getting model ready in CoreML format
CoreML exporter from customvision.ai
– Drag and drop training
5 minute shortcut to training, finetuning and
getting model ready in CoreML format
Drag and drop interface
Building a DL Website in 1 week
Less Data + Smaller Networks = Faster browser training
Several JavaScript Libraries
Run large CNNs
• Tensorfire
• WebDNN
• Keras-JS
• MXNetJS
• CaffeJS
Train and Run CNNs
• DeepLearn.js
• ConvNetJS
Train and Run LSTMs
• Brain.js
• Synaptic.js
Train and Run NNs
• Mind.js
• DN2A
ConvNetJS – Train + Infer on CPU
Both Train and Test NNs in browser
Train CNNs in browser
DeepLearn.js – Train + Infer on GPU
a
DeepLearn.js – Train + Infer on GPU
Uses WebGL to perform computation on GPU (including backprop)
Immediate execution model for inference (like Numpy)
Delayed execution model for training (like TensorFlow)
Upcoming tools to export weights from Tensorflow checkpoints
Tensorfire – Infer on GPU
Import models from Keras/Tensorflow
Any GPU works (including AMD), runs faster than TensorFlow on Macbook
Pro in browser
Supports low-precision math
Transforms NN weights into WebGL textures for speedup
Similar library : WebDNN.js
Keras.js
Run Keras models in browser, with GPU support.
Brain.JS
Train and run NNs in browser
Supports Feedforward, RNN, LSTM, GRU
No CNNs
Demo : http://brainjs.com/
Trained NN to recognize color contrast
MXNetJS
On Firefox and Microsoft Edge, performance is 8x faster than Chrome.
Optimization difference because of ASM.js.
Building a Crowdsourced Data Collector
in 1 months
Barcode recognition from Seeing AI
Live Guide user in finding a barcode with audio cues
With
Server
Decode barcode to identify product
Tech MPSCNN running on mobile GPU + barcode library
Metrics 40 FPS (~25 ms) on iPhone 7
Aim : Help blind users identify products using barcode
Issue : Blind users don’t know where the barcode is
Currency recognition from Seeing AI
Aim : Identify currency
Live Identify denomination of paper currency instantly
With
Server
-
Tech Task specific CNN running on mobile GPU
Metrics 40 FPS (~25 ms) on iPhone 7
Training Data Collection App
Request volunteers to take photos of objects
in non-obvious settings
Sends photos to cloud, trains model nightly
Newsletter shows the best photos from volunteers
Let them compete for fame
Daily challenge - Collected by volunteers
Daily challenge - Collected by volunteers
Challenge: Can you fool a Deep Neural Network?
Challenge users to find flaws in DNN
Helps trains a robust classifier with much lesser photos
Building a DL App in 6 months
What you want
https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016
$2000$200,000
What you can afford
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
AlexNet, 8 layers
(ILSVRC 2012)
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
Ultra
deep
ResNet, 152 layers 1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
7x7 conv, 64, /2, pool/2
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
28.2
25.8
16.4
11.7
7.3 6.7
3.6 2.9
ILSVRC'10 ILSVRC'11 ILSVRC'12
AlexNet
ILSVRC'13 ILSVRC'14
VGG
ILSVRC'14
GoogleNet
ILSVRC'15
ResNet
ILSVRC'16
Ensemble
ImageNet Classification top-5 error (%)
shallow 8 layers
19 layers 22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
Revolution of Depth vs Classification Accuracy
Ensemble of
Resnet, Inception
Resnet, Inception
and Wide Residual
Network
Your Budget - Smartphone Floating Point Operations Per Second (2015)
http://pages.experts-exchange.com/processing-power-compared/
iPhone X is more powerful than a Macbook Pro
https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
Accuracy vs Operations Per Image Inference
Size is proportional
to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want
Accuracy Per Parameter
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
Pick your DNN Architecture for your mobile architecture
Resnet Family
Under 64 ms on iPhone 7 using Metal GPU
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015
CoreML Benchmark - Pick your DNN for your mobile architecture
Model Top-1
Accuracy
Size of Model
(MB)
Million
Multi Adds
iPhone 6
Execution
Time (ms)
iPhone 6S
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
VGG 16 71 553 15300 4556 254 208
Inception v3 78 95 5000 637 98 90
Resnet 50 75 103 3900 557 72 64
MobileNet 71 17 569 109 52 32
SqueezeNet 57 5 1700 78 29 24
MobileNet family
Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise
conv
Tune with two parameters – Width Multiplier and resolution multiplier
Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
Comparison for DNN architectures for Object Detection
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
Strategies to make DNNs even more efficient
Shallow networks
Compressing pre-trained networks
Designing compact layers
Quantizing parameters
Network binarization
Pruning
Aim : Remove all connections
with absolute weights below a
threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
Observation : Most parameters in Fully Connected Layers
AlexNet 240 MB VGG-16 552 MB
96% of all
parameters
90% of all
parameters
Pruning gets quickest model compression without accuracy loss
AlexNet 240 MB VGG-16 552 MB
First layer which directly interacts with image is sensitive
and cannot be pruned too much without hurting
accuracy
Weight Sharing
Idea : Cluster weights with similar values together, and store in a dictionary.
Codebook
Huffman coding
HashedNets
Simplest implementation:
• Round all weights into 256 levels
• Tensorflow export script reduces inception zip file from 87 MB to 26 MB with
1% drop in precision
Selective training to keep networks shallow
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images
beyond +-45 degrees. Your phone will anyway rotate.
Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames
Design consideration for custom architectures – Small Filters
Three layers of 3x3 convolutions >> One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions
Replace NxN convolutions with stack of 1xN and Nx1
Fewer parameters 
Less compute 
More non-linearity 
Better
Faster
Stronger
Andrej Karpathy, CS-231n Notes, Lecture 11
SqueezeNet - AlexNet-level accuracy in 0.5 MB
SqueezeNet base 4.8 MB
SqueezeNet compressed 0.5 MB
80.3% top-5 Accuracy on ImageNet
0.72 GFLOPS/image
Fire Block
Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
Reduced precision
Reduce precision from 32 bits to <=16 bits or lesser
Use stochastic rounding for best results
In Practice:
• Ristretto + Caffe
• Automatic Network quantization
• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support
• Gemmlowp – Low precision matrix multiplication library
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net on Mobile
Challenges
Off the shelf CNNs not robust for video
Solutions:
• Collective confidence over several frames
• CortexNet
Building a DL App and get
$10 million in funding
(or a PhD)
Minerva
Minerva
DeepX Toolkit
Nicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016
EIE : Efficient Inference Engine on Compressed DNNs
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016
189x faster on CPU
13x faster on GPU
One Last Question
How to access the slides in 1 second
Link posted here -> @anirudhkoul

Contenu connexe

Tendances

Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOleg Mygryn
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlowSpotle.ai
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learningRidge-i, Inc.
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learningleopauly
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural NetworksAniket Maurya
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AIIntro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AISeth Grimes
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT Mia Chang
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...Edureka!
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 

Tendances (20)

Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Keras and TensorFlow
Keras and TensorFlowKeras and TensorFlow
Keras and TensorFlow
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AIIntro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 

Similaire à Squeezing Deep Learning Into Mobile Phones

Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobileAnirudh Koul
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptopRising Media, Inc.
 
Deep Learning on Qubole Data Platform
Deep Learning on Qubole Data PlatformDeep Learning on Qubole Data Platform
Deep Learning on Qubole Data PlatformShivaji Dutta
 
The deep learning tour - Q1 2017
The deep learning tour - Q1 2017 The deep learning tour - Q1 2017
The deep learning tour - Q1 2017 Eran Shlomo
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETMarco Parenzan
 
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...Willy Marroquin (WillyDevNET)
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform OverviewDavid Chou
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
 
Faster deep learning solutions from training to inference - Michele Tameni - ...
Faster deep learning solutions from training to inference - Michele Tameni - ...Faster deep learning solutions from training to inference - Michele Tameni - ...
Faster deep learning solutions from training to inference - Michele Tameni - ...Codemotion
 
Deep Learning at the Edge
Deep Learning at the EdgeDeep Learning at the Edge
Deep Learning at the EdgeJulien SIMON
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
모두를 위한 MxNET - AWS Summit Seoul 2017
모두를 위한 MxNET - AWS Summit Seoul 2017모두를 위한 MxNET - AWS Summit Seoul 2017
모두를 위한 MxNET - AWS Summit Seoul 2017Amazon Web Services Korea
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 
DotNet Conf Madrid 2019 - Whats New in ML.NET
DotNet Conf Madrid 2019 - Whats New in ML.NETDotNet Conf Madrid 2019 - Whats New in ML.NET
DotNet Conf Madrid 2019 - Whats New in ML.NETAlberto Diaz Martin
 

Similaire à Squeezing Deep Learning Into Mobile Phones (20)

Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
 
Amazon Deep Learning
Amazon Deep LearningAmazon Deep Learning
Amazon Deep Learning
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptop
 
Deep Learning on Qubole Data Platform
Deep Learning on Qubole Data PlatformDeep Learning on Qubole Data Platform
Deep Learning on Qubole Data Platform
 
The deep learning tour - Q1 2017
The deep learning tour - Q1 2017 The deep learning tour - Q1 2017
The deep learning tour - Q1 2017
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
 
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
Deep Learning on ECS
Deep Learning on ECSDeep Learning on ECS
Deep Learning on ECS
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
Faster deep learning solutions from training to inference - Michele Tameni - ...
Faster deep learning solutions from training to inference - Michele Tameni - ...Faster deep learning solutions from training to inference - Michele Tameni - ...
Faster deep learning solutions from training to inference - Michele Tameni - ...
 
Deep Learning at the Edge
Deep Learning at the EdgeDeep Learning at the Edge
Deep Learning at the Edge
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
모두를 위한 MxNET - AWS Summit Seoul 2017
모두를 위한 MxNET - AWS Summit Seoul 2017모두를 위한 MxNET - AWS Summit Seoul 2017
모두를 위한 MxNET - AWS Summit Seoul 2017
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
Persian MNIST in 5 Minutes
Persian MNIST in 5 MinutesPersian MNIST in 5 Minutes
Persian MNIST in 5 Minutes
 
DotNet Conf Madrid 2019 - Whats New in ML.NET
DotNet Conf Madrid 2019 - Whats New in ML.NETDotNet Conf Madrid 2019 - Whats New in ML.NET
DotNet Conf Madrid 2019 - Whats New in ML.NET
 

Dernier

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Dernier (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Squeezing Deep Learning Into Mobile Phones

  • 1. Squeezing Deep Learning into mobile phones - A Practitioners guide Anirudh Koul
  • 2. Anirudh Koul , @anirudhkoul , http://koul.ai Project Lead, Seeing AI (SeeingAI.com) Applied Researcher, Microsoft AI & Research Akoul at Microsoft dot com Currently working on applying artificial intelligence for Hololens, autonomous robots and accessibility Along with Eugene Seleznev, Saqib Shaikh, Meher Kasam
  • 3. Why Deep Learning On Mobile? Latency Privacy
  • 4. Response Time Limits – Powers of 10 0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention [Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
  • 5. Mobile Deep Learning Recipe Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)
  • 6. Building a DL App in _ time
  • 7. Building a DL App in 1 hour
  • 8. Use Cloud APIs Microsoft Cognitive Services Clarifai Google Cloud Vision IBM Watson Services Amazon Rekognition Tip : Resize to 224x224 at under 50% compression with bilinear interpolation before network transmission But don’t resize for Text / OCR projects
  • 9. Microsoft Cognitive Services Models won the 2015 ImageNet Large Scale Visual Recognition Challenge Vision, Face, Emotion, Video and 21 other topics
  • 10. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Upload 30 photos per class for make prototype model Upload 200 photos per class for more robust production model More distinct the shape/type of object, lesser images required.
  • 11. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Use Fatkun Browser Extension to download images from Search Engine, or use Bing Image Search API to programmatically download photos with proper rights
  • 12. Building a DL App in 1 day
  • 14. Base PreTrained Model ImageNet – 1000 Object Categorizer Inception Resnet
  • 15. Running pre-trained models on mobile Core ML Tensorflow Caffe2 Snapdragon Neural Processing Engine MXNet CNNDroid DeepLearningKit Torch
  • 16. Core ML From Apple, for iOS 11 Convert Caffe/Tensorflow model to CoreML model in 3 lines: import coremltools coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’) coreml_model.save('my_model.mlmodel’) Add model to iOS project and call for prediction. Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM Builds on top of low-level primitives Accelerate, BNNS, Metal Performance Shaders (MPS) Noticable speedup between MPS (iOS 10) vs CoreML implementation (iOS 11) (same model, same hardware) Automatically minimizes memory footprint and power consumption
  • 17. CoreML Benchmark - Pick a DNN for your mobile architecture Model Top-1 Accuracy Size of Model (MB) iPhone 6 Execution Time (ms) iPhone 6S Execution Time (ms) iPhone 7 Execution Time (ms) VGG 16 71 553 4556 254 208 Inception v3 78 95 637 98 90 Resnet 50 75 103 557 72 64 MobileNet 71 17 109 52 32 SqueezeNet 57 5 78 29 24 2014 2015 2016 Huge improvement in hardware in 2015
  • 18. Putting out more frames than an art gallery
  • 19. Tensorflow Easy pipeline to bring Tensorflow models to mobile Excellent documentation Optimizations to bring model to mobile Upcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware
  • 20. Caffe2 From Facebook Under 1 MB of binary size Built for Speed : For ARM CPU : Uses NEON Kernels, NNPack For iPhone GPU : Uses Metal Performance Shaders and Metal For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup) ONNX format support to import models from CNTK/PyTorch
  • 21. Snapdragon Neural Processing Engine (NPE) SDK Like CoreML for Qualcomm Snapdragon chips Published speedup of 4-5x On about half the Android phones Identifies best target core for inference - GPU, DSP or CPU Customizable to choose between battery power and performance Supports importing models from Caffe, Caffe2, Tensorflow
  • 22. MXNET Amalgamation : Pack all the code in a single source file Pro: • Cross Platform (iOS, Android), Easy porting • Usable in any programming language Con: • CPU only, Slow https://github.com/Leliana/WhatsThis
  • 23. CNNdroid GPU accelerated CNNs for Android Supports Caffe, Torch and Theano models ~30-40x Speedup using mobile GPU vs CPU (AlexNet) Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler
  • 24. DeepLearningKit Platform : iOS, OS X and tvOS (Apple TV) DNN Type : CNNs models trained in Caffe Runs on mobile GPU, uses Metal Pro : Fast, directly ingests Caffe models Con : Unmaintained
  • 25. Running pre-trained models on mobile Mobile Library Platform GPU DNN Architecture Supported Trained Models Supported CoreML iOS Yes CNN, RNN, SciKit Keras, Tensorflow, MXNet Tensorflow iOS/Android Yes CNN,RNN,LSTM, etc Tensorflow Caffe2 iOS/Android Yes CNN Caffe2, CNTK, PyTorch Snapdragon NPE Android Yes CNN, RNN, LSTM Caffe, Caffe2, Tensorflow CNNDroid Android Yes CNN Caffe, Torch, Theano DeepLearningKit iOS Yes CNN Caffe MXNet iOS/Android No CNN,RNN,LSTM, etc MXNet Torch iOS/Android No CNN,RNN,LSTM, etc Torch
  • 26. Possible Long Term Route for fastest speed on each phone Train a model using your favorite DNN library Import it to Tensorflow/Keras with Tensorflow For iOS : Use Keras and CoreML For Android : For Qualcomm chips (~50% of android phones) : Using Snapdragon NPE For remaining phones Use Tensorflow Mobile Model (Tensorflow format) Keras + CoreML Snapdragon NPE Tensorflow Mobile iOS Android Qualcomm Chips Remaining
  • 27. Building a DL App in 1 week
  • 28. Learn Playing an Accordion 3 months
  • 29. Learn Playing an Accordion 3 months Knows Piano Fine Tune Skills 1 week
  • 30. I got a dataset, Now What? Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks “Don’t Be A Hero” - Andrej Karpathy
  • 31. How to find pretrained models for my task? Search “Model Zoo” Microsoft Cognitive Toolkit (previously called CNTK) – 50 Models Caffe Model Zoo Keras Tensorflow MXNet
  • 32. AlexNet, 2012 (simplified) [Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation
  • 33. Deciding how to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 34. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 35. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 36. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 37. CoreML exporter from customvision.ai – Drag and drop training 5 minute shortcut to finetuning and getting model ready in CoreML format
  • 38. CoreML exporter from customvision.ai – Drag and drop training 5 minute shortcut to training, finetuning and getting model ready in CoreML format Drag and drop interface
  • 39. Building a DL Website in 1 week
  • 40. Less Data + Smaller Networks = Faster browser training
  • 41. Several JavaScript Libraries Run large CNNs • Tensorfire • WebDNN • Keras-JS • MXNetJS • CaffeJS Train and Run CNNs • DeepLearn.js • ConvNetJS Train and Run LSTMs • Brain.js • Synaptic.js Train and Run NNs • Mind.js • DN2A
  • 42. ConvNetJS – Train + Infer on CPU Both Train and Test NNs in browser Train CNNs in browser
  • 43. DeepLearn.js – Train + Infer on GPU a
  • 44. DeepLearn.js – Train + Infer on GPU Uses WebGL to perform computation on GPU (including backprop) Immediate execution model for inference (like Numpy) Delayed execution model for training (like TensorFlow) Upcoming tools to export weights from Tensorflow checkpoints
  • 45. Tensorfire – Infer on GPU Import models from Keras/Tensorflow Any GPU works (including AMD), runs faster than TensorFlow on Macbook Pro in browser Supports low-precision math Transforms NN weights into WebGL textures for speedup Similar library : WebDNN.js
  • 46. Keras.js Run Keras models in browser, with GPU support.
  • 47. Brain.JS Train and run NNs in browser Supports Feedforward, RNN, LSTM, GRU No CNNs Demo : http://brainjs.com/ Trained NN to recognize color contrast
  • 48. MXNetJS On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.
  • 49. Building a Crowdsourced Data Collector in 1 months
  • 50. Barcode recognition from Seeing AI Live Guide user in finding a barcode with audio cues With Server Decode barcode to identify product Tech MPSCNN running on mobile GPU + barcode library Metrics 40 FPS (~25 ms) on iPhone 7 Aim : Help blind users identify products using barcode Issue : Blind users don’t know where the barcode is
  • 51. Currency recognition from Seeing AI Aim : Identify currency Live Identify denomination of paper currency instantly With Server - Tech Task specific CNN running on mobile GPU Metrics 40 FPS (~25 ms) on iPhone 7
  • 52. Training Data Collection App Request volunteers to take photos of objects in non-obvious settings Sends photos to cloud, trains model nightly Newsletter shows the best photos from volunteers Let them compete for fame
  • 53. Daily challenge - Collected by volunteers
  • 54. Daily challenge - Collected by volunteers
  • 55. Challenge: Can you fool a Deep Neural Network? Challenge users to find flaws in DNN Helps trains a robust classifier with much lesser photos
  • 56. Building a DL App in 6 months
  • 57. What you want https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016 $2000$200,000 What you can afford
  • 58. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  • 59. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  • 60. AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2 VGG, 19 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Ultra deep
  • 61. ResNet, 152 layers 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 7x7 conv, 64, /2, pool/2 Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  • 62. 28.2 25.8 16.4 11.7 7.3 6.7 3.6 2.9 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'13 ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'16 Ensemble ImageNet Classification top-5 error (%) shallow 8 layers 19 layers 22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Revolution of Depth vs Classification Accuracy Ensemble of Resnet, Inception Resnet, Inception and Wide Residual Network
  • 63. Your Budget - Smartphone Floating Point Operations Per Second (2015) http://pages.experts-exchange.com/processing-power-compared/
  • 64. iPhone X is more powerful than a Macbook Pro https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
  • 65. Accuracy vs Operations Per Image Inference Size is proportional to num parameters Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016 552 MB 240 MB What we want
  • 66. Accuracy Per Parameter Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
  • 67. Pick your DNN Architecture for your mobile architecture Resnet Family Under 64 ms on iPhone 7 using Metal GPU Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015
  • 68. CoreML Benchmark - Pick your DNN for your mobile architecture Model Top-1 Accuracy Size of Model (MB) Million Multi Adds iPhone 6 Execution Time (ms) iPhone 6S Execution Time (ms) iPhone 7 Execution Time (ms) VGG 16 71 553 15300 4556 254 208 Inception v3 78 95 5000 637 98 90 Resnet 50 75 103 3900 557 72 64 MobileNet 71 17 569 109 52 32 SqueezeNet 57 5 1700 78 29 24
  • 69. MobileNet family Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv Tune with two parameters – Width Multiplier and resolution multiplier Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
  • 70. Comparison for DNN architectures for Object Detection Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
  • 71. Strategies to make DNNs even more efficient Shallow networks Compressing pre-trained networks Designing compact layers Quantizing parameters Network binarization
  • 72. Pruning Aim : Remove all connections with absolute weights below a threshold Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
  • 73. Observation : Most parameters in Fully Connected Layers AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters
  • 74. Pruning gets quickest model compression without accuracy loss AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
  • 75. Weight Sharing Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Simplest implementation: • Round all weights into 256 levels • Tensorflow export script reduces inception zip file from 87 MB to 26 MB with 1% drop in precision
  • 76. Selective training to keep networks shallow Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames
  • 77. Design consideration for custom architectures – Small Filters Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters  Less compute  More non-linearity  Better Faster Stronger Andrej Karpathy, CS-231n Notes, Lecture 11
  • 78. SqueezeNet - AlexNet-level accuracy in 0.5 MB SqueezeNet base 4.8 MB SqueezeNet compressed 0.5 MB 80.3% top-5 Accuracy on ImageNet 0.72 GFLOPS/image Fire Block Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
  • 79. Reduced precision Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice: • Ristretto + Caffe • Automatic Network quantization • Finds balance between compression rate and accuracy • Apple Metal Performance Shaders automatically quantize to 16 bits • Tensorflow has 8 bit quantization support • Gemmlowp – Low precision matrix multiplication library
  • 80. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 81. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 82. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 83. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 84. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 85. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 87. Challenges Off the shelf CNNs not robust for video Solutions: • Collective confidence over several frames • CortexNet
  • 88. Building a DL App and get $10 million in funding (or a PhD)
  • 89.
  • 92. DeepX Toolkit Nicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016
  • 93. EIE : Efficient Inference Engine on Compressed DNNs Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016 189x faster on CPU 13x faster on GPU
  • 95. How to access the slides in 1 second Link posted here -> @anirudhkoul

Notes de l'éditeur

  1. https://hongkongphooey.wordpress.com/2009/02/18/first-look-huawei-android-phone/ https://medium.com/@startuphackers/building-a-deep-learning-neural-network-startup-7032932e09c
  2. [Miller 1968; Card et al. 1991]: 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result. 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data. 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
  3. No, don’t do it right now. Do it in the next session.
  4. If you need a hand warmer on a cold day, I suggest you try training on a phone
  5. Core ML supports a variety of machine learning models, including neural networks, tree ensembles, support vector machines, and generalized linear models. Core ML requires the Core ML model format (models with a .mlmodel file extension). Apple provides several popular, open source models that are already in the Core ML model format. You can download these models and start using them in your app. Additionally, various research groups and universities publish their models and training data, which may not be in the Core ML model format. To use these models, you need to convert them, as described in Converting Trained Models to Core ML. Internally does context switching between GPU and CPU. Uses Accelerate for CPU (For eg Sentiment Analysis) and GPU (for Image Classification)
  6. Speedups : No need to decode JPEGs, directly deal with camera image buffers
  7. “surprisingly, ARM CPUs outperform the on-board GPUs (our NNPACK ARM CPU implementation outperforms Apple’s MPSCNNConvolution for all devices except the iPhone 7). There are other advantages to offloading compute onto the GPU/DSP, and it’s an active work in progress to expose these in Caffe2 “ Built for first class support on phones from 2015 onwards. Does support runs on 2013 beyond models Use NEON Kernels for certan operations like transpose Heavily use NNPAck, extemely fast for convoltuions on ARM CPUs. Even phones from 2014 without GPUs, arm cpu will outperform. NNPack implements Winograd  for convoluation math. Convert convolution to element wise multiplication. Reduces number of flops by 2.5 times also works on any arm cpu, which doesn't limit you to cell phone Under 1 MB of compiled binary size
  8. Running on CPU all the time will cause high heat and battery DSP is low power, better suited for things that need to be on all the time, like voice commands
  9. Very memory efficient. MXNet can consume as little as 4 GB of memory when serving deep networks with as many as 1000 layers Deep learning (DL) systems are complex and often have a few of dependencies. It is often painful to port a DL library into different platforms, especially for smart devices. There is one fun way to solve this problem: provide a light interface and putting all required codes into a single file with minimal dependencies. The idea of amalgamation comes from SQLite and other projects, which pack all code into a single source file. To create the library, you only need to compile that single file. This simplifies porting to various platforms. Thanks to Jack Deng, MXNet provides an amalgamation script, that compiles all code needed for prediction based on trained DL models into a single .cc file, which has approximately 30K lines of code. The only dependency is a BLAS library. The compiled library can be used by any other programming language. By using amalgamation, we can easily port the prediction library to mobile devices, with nearly no dependency. Compiling on a smart platform is no longer a painful task. After compiling the library for smart platforms, the last thing is to call C-API in the target language (Java/Swift). This does not use GPU. It mentions about dependency on BLAS, because of which it seems it uses CPU on mobile BLAS (Basic Linear Algebraic Subprograms) is at the heart of AI computation. Because of the sheer amount of number-crunching involved in these complex models the math routines must be optimized as much as possible. The computational firepower of GPUs make them ideal processors for AI models. It appears that MXNet can use Atlas (libblas), OpenBLAS, and MKL. These are CPU-based libraries. Currently the main option for running BLAS on a GPU is CuBLAS, developed specifically for NVIDIA (CUDA) GPUs. Apparently MXNet can use CuBLAS in addition to the CPU libraries. The GPU in many mobile devices is a lower-power chip that works with ARM architectures which doesn't have a dedicated BLAS library yet. what are my other options? Just go with the CPU. Since it's the training that's extremely compute-intensive, using the CPU for inference isn't the show-stopper you think it is. In OpenBLAS, the routines are written in assembly and hand-optimized for each CPU it can run on. This includes ARM. Using a C++-based framework like MXNet is probably the best choice if you are trying to go cross-platform.
  10. Different methods are employed in acceleration of different layers in CNNdroid. Convolution and fully connected layers, which are data-parallel and normally more compute intensive, are accelerated on the mobile GPU using RenderScript framework. A considerable portion of these two layers can be expressed as dot products. The dot products are more efficiently calculated on SIMD units of the target mobile GPU. Therefore, we divide the computation in many vector operations and use the pre-defined dot function of the RenderScript framework. In other words, we explicitly express this level of parallelism in software, and as opposed to CUDA-based desktop libraries, do not leave it to GPU’s hardware scheduler. Comparing with convolution and fully connected layers, other layers are relatively less compute intensive and not efficient on mobile GPU. Therefore, they are accelerated on multi-core mobile CPU via multi-threading. Since ReLU layer usually appears after a convolution or fully connected layer, it is embedded into its previous layer in order to increase the performance in cases where multiple images are fed to the CNNdroid engine
  11. You don't need microsoft 's ocean boiling gpu cluster
  12. Learned hierarchical features from a deep learning algorithm. Each feature can be thought of as a filter, which filters the input image for that feature (a nose). If the feature is found, the responsible units generate large activations, which can be picked up by the later classifier stages as a good indicator that the class is present.
  13. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  14. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  15. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  16. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  17. The data input is so small that most of the time is spent in just conversion between python and the C++ core, while the JS is just one language. You are using only one core in Tensorflow while the JS could potentially leverage more than one the JS library is able to create a highly optimized JIT version of the program.
  18. Synaptic and Mind are often used for running on node.js servers The node.js ones are often used for training continuous data from Accelerometers, Sales forecast installation-free DNN execution framework
  19. Original Boss This demo that treats the pixels of an image as a learning problem: it takes the (x,y) position on a grid and learns to predict the color at that point using regression to (r,g,b). It's a bit like compression, since the image information is encoded in the weights of the network, but almost certainly not of practical kind 
  20. deeplearn.js offers a significant speedup by exploiting WebGL to perform computations on the GPU, along with the ability to do full backpropagation. The API mimics the structure of TensorFlow and NumPy, with a delayed execution model for training (like TensorFlow), and an immediate execution model for inference (like NumPy). Also includes implemented versions of some of the most commonly-used TensorFlow operations. With the release of deeplearn.js, will include tools in future to export weights from TensorFlow checkpoints, which will allow authors to import them into web pages for deeplearn.js inference.
  21. deeplearn.js offers a significant speedup by exploiting WebGL to perform computations on the GPU, along with the ability to do full backpropagation. The API mimics the structure of TensorFlow and NumPy, with a delayed execution model for training (like TensorFlow), and an immediate execution model for inference (like NumPy). Also includes implemented versions of some of the most commonly-used TensorFlow operations. With the release of deeplearn.js, will include tools in future to export weights from TensorFlow checkpoints, which will allow authors to import them into web pages for deeplearn.js inference.
  22. By transforming neural network weights into WebGL textures and implementing common layers as fragment shaders, it uses the graphics capabilities of browsers designed for 3D games to speed up the execution of neural networks. Unlike other WebGL compute frameworks, it support low-precision quantized tensors.
  23. You can load 50 Layer Resnet, Inception V3, Bidirectional LSTM
  24. On Microsoft Edge and Firefox, performance is at least 8 times better than Google Chrome. We assume it is optimization difference on ASM.js.
  25. As we all have painfully experienced, in real life, what you really want, is not what you can always afford. And that's the same in machine learning, We all know deep learning works if you have large GPU servers, what about when you want to run it on a tiny little device. What's the number 1 limitation, it turns out to be memory. If you look at image net in the last couple of years, it started with 240 megabytes. VGG was over half a gig. So the question we will solve now is how to get these neural networks do these amazing things yet have a very small memory footprint
  26. Pause after showing this.
  27. Just more layers, nothing special
  28. Here is another view of this model. This is the equivalent of Big data for Powerpoint
  29. Errors are reducing by 40% year on year Previously, they used to reduce by 5% year by year
  30. Apple is increasing the core count from two to six with a new A11 chip. Two of the cores are meant to do the bulk of intensive processing, while the other four are high efficiency cores dedicated to low-power tasks.
  31. Compromise between accuracy, number of parameters,
  32. iPhone 7 has a considerable mobile gpu....10 years ago when CUDA came out, desktop GPUS had similar performance
  33. DNNs often suffer from over-parameterization and large amount of redundancy in their models. This typically results in inefficient computation and memory usage
  34. Pruning redundant, non-informative weights in a previously trained network reduces the size of the network at inference time. Take a network, prune, and then retrain the remaining connections
  35. VGG-16 contains 90% of the weights AlexNet contains 96% of the weights Most computation happen in convolutional layers
  36. VGG-16 contains 90% of the weights AlexNet contains 96% of the weights Resnet, GoogleNet, Inception have majority convolutional layers, so they compress less Caffe2 does this, but does dense multiplication
  37. Facebook app uses this
  38. Word Lens app uses this
  39. 1x1 bottleneck convolutions are very efficient
  40. SqueezeNet has been recently released. It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms. Strategy 1. Replace 3x3 filters with 1x1 filters Strategy 2. Decrease the number of input channels to 3x3 filters Strategy 3. Downsample late in the network so that convolution layers have large activation maps.
  41. Ristretto is an automated CNN-approximation tool which condenses 32-bit floating point networks. Ristretto is an extension of Caffe and allows to test, train and fine-tune networks with limited numerical precision. Ristretto In a Minute Ristretto Tool: The Ristretto tool performs automatic network quantization and scoring, using different bit-widths for number representation, to find a good balance between compression rate and network accuracy. Ristretto Layers: Ristretto re-implements Caffe-layers and simulates reduced word width arithmetic. Testing and Training: Thanks to Ristretto’s smooth integration into Caffe, network description files can be changed to quantize different layers. The bit-width used for different layers as well as other parameters can be set in the network’s prototxt file. This allows to directly test and train condensed networks, without any need of recompilation.
  42. Reduce weights it to binary values, then scale them during training
  43. Reduce weights it to binary values, then scale them during training
  44. Reduce weights it to binary values, then scale them during training
  45. Now that I see this slide, this should probably have been the title for this session. We would have gotten a lot more people in this room.
  46. Minerva consists of five stages, as shown in Figure 2. Stages 1–2 establish a fair baseline accelerator implementation. Stage 1 generates the baseline DNN: fixing a network topology and a set of trained weights. Stage 2 selects an optimal baseline accelerator implementation. Stages 3– 5 employ novel co-design optimizations to minimize power consumption over the baseline in the following ways: Stage 3 analyzes the dynamic range of all DNN signals and reduces slack in data type precision. Stage 4 exploits observed network sparsity to minimize data accesses and MAC operations. Stage 5 introduces a novel fault mitigation technique, which allows for aggressive SRAM supply voltage reduction. For each of the three optimization stages, the ML level measures the impact on prediction accuracy, the architecture level evaluates hardware resource savings, and the circuit level characterizes the hardware models and validates simulation results.