SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Large-Scale Training
with GPUs at Facebook
Aapo Kyrola
Distributed AI Team @ Facebook
1
1. Quick intro to Caffe2 framework
2. Parallel Training: Async & Sync
3. Synchronous SGD with Caffe2 and GLOO
4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour
Contents
Deep Learning Frameworks by FB
Caffe2
4
• A lightweight framework for deep learning / ML / ..
• Primarily designed for production use cases and large-scale training
• Speed and low footprint
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
5
• Describes model as a DAG of operators and blobs
• Caffe2 runtime does not have any deep learning concepts
à it just executes a DAG
• DAG also covers loss functions, data reading, metrics,
etc…
• Graph
• construction in Python, incl. auto-gradient (flexibility),
• description in Protobuf (portability),
Computational graph
6
TRAINING AS A DIRECTED GRAPH
FC
Y
DataReader (op)
b
X
W
CrossEntropy
FCGradient
IterOp
LearningRate
WeightedSum
WeightedSum
Label
W_grad b_grad
Loss
Graph Example
Parallel Training
8
Asynchronous and Synchronous SGD
Asynchronous SGD
• Parameters are updated by parallel
workers in a “best-effort” basis, “in the
background”.
• Various algorithms how to adjust learning
to handle delayed updates, such as EASGD
or Block Momentum
• Parameter Servers manage parameters
• Can be used for very large models that do
not fit in one machine.
Synchronous SGD
• Workers synchronize (”all-reduce”)
parameter gradients after each iteration
• Models are always in sync.
• Mathematically the number of workers
does not matter: computation is function
of the total batch size only.
+ Async can scale to very large clusters.
- Async requires tuning when runtime characteristics change
+ Sync result is not affected by execution: only function of total batch size
- Sync is harder to scale to large clusters
Async vs. Sync
GPUs are very fast, so we can use fewer servers for computation
à Sync SGD can scale sufficiently.
Sync SGD with Caffe2 + GLOO
11
• Simple interface: data_parallel_model (DPM) for both multi-GPU and multi-
GPU-multi-host models.
• DPM injects AllReduce and Broadcast operators to the graph
• (Caffe2 runtime does not know about being parallel – all based on operators)
• Each worker runs the same code, same DAG, in parallel.
• AllReduce & Broadcasts act as implicit barriers.
SyncSGD with Caffe2
SyncSGD with Caffe2
ConvGradient
FCGradient
fc1_grad
input_grad
conv1_grad
conv1_w_grad
fc1_w_grad
AllReduce
AllReduce
conv1_w_grad
ParamUpdate
fc1_w_grad
ParamUpdate
Parameter updates execute in
parallel with the backward
pass.
GLOO
https://github.com/facebookincubator/gloo
• Library for very fast distributed reductions: AllReduce,
Reduce, Broadcast, Allgather
• External library. Operators for Caffe2 and Pytorch.
• “Mini-MPI”
• Uses NVIDIA’s NCCL for inter-GPU reductions
• TCP/IP and RDMA transports supported
Case Study: ImageNet in 1hr
15
• Train Resnet-50 (most popular image detection architecture) on Imagenet-1K
dataset in less than hour to ~ state-of-the-art accuracy.
• On a single 8-gpu P100: ~ 1.5 days to do 90 epochs.
• Why? (A) training faster improves development iterations;
• (B) enables training with extremely large datasets in reasonable time
Goal
June, 2017
• Accuracy
• Very large mini-batch sizes believed to hurt convergence
• Scale efficiently
• Facebook uses commodity networking (i.e no InfiniBand)
• 32 x 8 P100 NVIDIA GPUs, Big Basin architecture
Challenges
”Big Basin” architecture, open sourced design
Baseline 8 GPUs
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
• k = #gpus
• n = per gpus batch size
• 𝜂 = learning rate
• 256 = 8 x 32
32 x 8 GPUs: same Learning Rate
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.1, 41.78% 0.10
8192 = 256 x 32
#gpus
per gpu batch
32 x 8 GPUs: Sqrt-scaling of LR?
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.6, 26.28% 0.03
32 x 8 GPUs: Linear Scaling of LR?
Linear Scaling + Constant LR Warmup
rapid changes in the beginning of training-> use small LR for first few epochs
Linear Scaling + Gradual LR Warmup
start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂	after 5 epochs
Linear LR scaling
In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique.
More tricks in the paper.
Scaling Efficiently
Efficient All-Reduce
• Resnet-50: 25 million float parameters (100mb)
• Each iteration about ~0.3 secs, backward pass run in parallel with all-
reduces à Latency is not an issue.
• Halving-Doubling algorithm by Thakur et. al. provides optimal
throughput
3x speedup vs. “ring-algorithm” for all-
reduce on 32 servers
Using 100% commodity hardware and open source software stack, 90% scaling efficiency
Follow-up Work by Others
• Already several follow-up papers to reproduce &
improve on our results
• For example: You, Gitman, Ginsburg demonstrate using
batch size up to 32K (using layer-wise adaptive learning
rate)
• Alternatives to GPUs, such as Intel Xeon Phis
On-going work
• Elasticity: survicve crashes; incrementally add nodes to
cluster then they become available
• Data input is becoming a bottleneck
• Fp16 for training
• Implement & Experiment with asynchronous algorithms
Lessons Learned
• SyncSGD can go a long way and has fewer tunable
parameters than asynchronous SGD.
• Learning Rate is the fundamental parameter when
increasing mini-batch size.
• Utilize the inherent parallelism in training to hide
latency.
• Commodity hardware can go a long way
Thank You!
Caffe2.ai

Contenu connexe

Tendances

Kubeflow Control Plane 中文
Kubeflow Control Plane 中文Kubeflow Control Plane 中文
Kubeflow Control Plane 中文Weiqiang Zhuang
 
Serving models from AWS Lambda
Serving models from AWS LambdaServing models from AWS Lambda
Serving models from AWS LambdaAlexey Grigorev
 
Kubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOKubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOAnimesh Singh
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformShunya Ueta
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLSeldon
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowSteve Guhr
 
Hkube
HkubeHkube
Hkubehkube
 
Introduction to GraalVM
Introduction to GraalVMIntroduction to GraalVM
Introduction to GraalVMSHASHI KUMAR
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Costanoa Ventures
 
Tableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TWTableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TWYen-Wen Chen
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward
 
From AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryFrom AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryYen-Wen Chen
 
Native Java with GraalVM
Native Java with GraalVMNative Java with GraalVM
Native Java with GraalVMSylvain Wallez
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...Naoki (Neo) SATO
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward
 

Tendances (20)

Kubeflow Control Plane 中文
Kubeflow Control Plane 中文Kubeflow Control Plane 中文
Kubeflow Control Plane 中文
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Serving models from AWS Lambda
Serving models from AWS LambdaServing models from AWS Lambda
Serving models from AWS Lambda
 
Kubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOKubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPO
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platform
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using Kubeflow
 
Hkube
HkubeHkube
Hkube
 
Introduction to GraalVM
Introduction to GraalVMIntroduction to GraalVM
Introduction to GraalVM
 
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...
 
Tableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TWTableapp architecture migration story for GCPUG.TW
Tableapp architecture migration story for GCPUG.TW
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
 
From AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryFrom AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture Story
 
Native Java with GraalVM
Native Java with GraalVMNative Java with GraalVM
Native Java with GraalVM
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
[Container X mas Party with flexy] Machine Learning Lifecycle with Kubeflow o...
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
 

Similaire à Large-Scale GPU Training at Facebook in Just 1 Hour

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskipGerard Klijs
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh
 
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...Redis Labs
 

Similaire à Large-Scale GPU Training at Facebook in Just 1 Hour (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskip
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
Build a Deep Learning App with Tensorflow & Redis by Jayesh Ahire and Sherin ...
 

Plus de Faisal Siddiqi

Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Faisal Siddiqi
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelFaisal Siddiqi
 
Machine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolMachine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolFaisal Siddiqi
 

Plus de Faisal Siddiqi (7)

Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time Travel
 
Machine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolMachine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make School
 

Dernier

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Dernier (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Large-Scale GPU Training at Facebook in Just 1 Hour

  • 1. Large-Scale Training with GPUs at Facebook Aapo Kyrola Distributed AI Team @ Facebook 1
  • 2. 1. Quick intro to Caffe2 framework 2. Parallel Training: Async & Sync 3. Synchronous SGD with Caffe2 and GLOO 4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour Contents
  • 5. • A lightweight framework for deep learning / ML / .. • Primarily designed for production use cases and large-scale training • Speed and low footprint • C++ / Python based interfaces • Supports deployment on multiple platforms • Linux, Mac, iOS, Android and Windows • IoT devices, Raspberry Pi, Tegra X1, ... Caffe2 is... 5
  • 6. • Describes model as a DAG of operators and blobs • Caffe2 runtime does not have any deep learning concepts à it just executes a DAG • DAG also covers loss functions, data reading, metrics, etc… • Graph • construction in Python, incl. auto-gradient (flexibility), • description in Protobuf (portability), Computational graph 6
  • 7. TRAINING AS A DIRECTED GRAPH FC Y DataReader (op) b X W CrossEntropy FCGradient IterOp LearningRate WeightedSum WeightedSum Label W_grad b_grad Loss Graph Example
  • 9. Asynchronous and Synchronous SGD Asynchronous SGD • Parameters are updated by parallel workers in a “best-effort” basis, “in the background”. • Various algorithms how to adjust learning to handle delayed updates, such as EASGD or Block Momentum • Parameter Servers manage parameters • Can be used for very large models that do not fit in one machine. Synchronous SGD • Workers synchronize (”all-reduce”) parameter gradients after each iteration • Models are always in sync. • Mathematically the number of workers does not matter: computation is function of the total batch size only.
  • 10. + Async can scale to very large clusters. - Async requires tuning when runtime characteristics change + Sync result is not affected by execution: only function of total batch size - Sync is harder to scale to large clusters Async vs. Sync GPUs are very fast, so we can use fewer servers for computation à Sync SGD can scale sufficiently.
  • 11. Sync SGD with Caffe2 + GLOO 11
  • 12. • Simple interface: data_parallel_model (DPM) for both multi-GPU and multi- GPU-multi-host models. • DPM injects AllReduce and Broadcast operators to the graph • (Caffe2 runtime does not know about being parallel – all based on operators) • Each worker runs the same code, same DAG, in parallel. • AllReduce & Broadcasts act as implicit barriers. SyncSGD with Caffe2
  • 14. GLOO https://github.com/facebookincubator/gloo • Library for very fast distributed reductions: AllReduce, Reduce, Broadcast, Allgather • External library. Operators for Caffe2 and Pytorch. • “Mini-MPI” • Uses NVIDIA’s NCCL for inter-GPU reductions • TCP/IP and RDMA transports supported
  • 15. Case Study: ImageNet in 1hr 15
  • 16. • Train Resnet-50 (most popular image detection architecture) on Imagenet-1K dataset in less than hour to ~ state-of-the-art accuracy. • On a single 8-gpu P100: ~ 1.5 days to do 90 epochs. • Why? (A) training faster improves development iterations; • (B) enables training with extremely large datasets in reasonable time Goal June, 2017
  • 17. • Accuracy • Very large mini-batch sizes believed to hurt convergence • Scale efficiently • Facebook uses commodity networking (i.e no InfiniBand) • 32 x 8 P100 NVIDIA GPUs, Big Basin architecture Challenges ”Big Basin” architecture, open sourced design
  • 18. Baseline 8 GPUs 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 • k = #gpus • n = per gpus batch size • 𝜂 = learning rate • 256 = 8 x 32
  • 19. 32 x 8 GPUs: same Learning Rate 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.1, 41.78% 0.10 8192 = 256 x 32 #gpus per gpu batch
  • 20. 32 x 8 GPUs: Sqrt-scaling of LR? 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.6, 26.28% 0.03
  • 21. 32 x 8 GPUs: Linear Scaling of LR?
  • 22. Linear Scaling + Constant LR Warmup rapid changes in the beginning of training-> use small LR for first few epochs
  • 23. Linear Scaling + Gradual LR Warmup start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂 after 5 epochs
  • 24. Linear LR scaling In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique. More tricks in the paper.
  • 26. Efficient All-Reduce • Resnet-50: 25 million float parameters (100mb) • Each iteration about ~0.3 secs, backward pass run in parallel with all- reduces à Latency is not an issue. • Halving-Doubling algorithm by Thakur et. al. provides optimal throughput
  • 27. 3x speedup vs. “ring-algorithm” for all- reduce on 32 servers
  • 28. Using 100% commodity hardware and open source software stack, 90% scaling efficiency
  • 29. Follow-up Work by Others • Already several follow-up papers to reproduce & improve on our results • For example: You, Gitman, Ginsburg demonstrate using batch size up to 32K (using layer-wise adaptive learning rate) • Alternatives to GPUs, such as Intel Xeon Phis
  • 30. On-going work • Elasticity: survicve crashes; incrementally add nodes to cluster then they become available • Data input is becoming a bottleneck • Fp16 for training • Implement & Experiment with asynchronous algorithms
  • 31. Lessons Learned • SyncSGD can go a long way and has fewer tunable parameters than asynchronous SGD. • Learning Rate is the fundamental parameter when increasing mini-batch size. • Utilize the inherent parallelism in training to hide latency. • Commodity hardware can go a long way