This document discusses large-scale distributed training with GPUs at Facebook using their Caffe2 framework. It describes how Facebook was able to train the ResNet-50 model on the ImageNet dataset in just 1 hour using 32 GPUs with 8 GPUs each. It explains how synchronous SGD was implemented in Caffe2 using Gloo for efficient all-reduce operations. Linear scaling of the learning rate with increased batch size was found to work best when gradually warming up the learning rate over the first few epochs. Nearly linear speedup was achieved using this approach on commodity hardware.
2. 1. Quick intro to Caffe2 framework
2. Parallel Training: Async & Sync
3. Synchronous SGD with Caffe2 and GLOO
4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour
Contents
5. • A lightweight framework for deep learning / ML / ..
• Primarily designed for production use cases and large-scale training
• Speed and low footprint
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
5
6. • Describes model as a DAG of operators and blobs
• Caffe2 runtime does not have any deep learning concepts
à it just executes a DAG
• DAG also covers loss functions, data reading, metrics,
etc…
• Graph
• construction in Python, incl. auto-gradient (flexibility),
• description in Protobuf (portability),
Computational graph
6
7. TRAINING AS A DIRECTED GRAPH
FC
Y
DataReader (op)
b
X
W
CrossEntropy
FCGradient
IterOp
LearningRate
WeightedSum
WeightedSum
Label
W_grad b_grad
Loss
Graph Example
9. Asynchronous and Synchronous SGD
Asynchronous SGD
• Parameters are updated by parallel
workers in a “best-effort” basis, “in the
background”.
• Various algorithms how to adjust learning
to handle delayed updates, such as EASGD
or Block Momentum
• Parameter Servers manage parameters
• Can be used for very large models that do
not fit in one machine.
Synchronous SGD
• Workers synchronize (”all-reduce”)
parameter gradients after each iteration
• Models are always in sync.
• Mathematically the number of workers
does not matter: computation is function
of the total batch size only.
10. + Async can scale to very large clusters.
- Async requires tuning when runtime characteristics change
+ Sync result is not affected by execution: only function of total batch size
- Sync is harder to scale to large clusters
Async vs. Sync
GPUs are very fast, so we can use fewer servers for computation
à Sync SGD can scale sufficiently.
12. • Simple interface: data_parallel_model (DPM) for both multi-GPU and multi-
GPU-multi-host models.
• DPM injects AllReduce and Broadcast operators to the graph
• (Caffe2 runtime does not know about being parallel – all based on operators)
• Each worker runs the same code, same DAG, in parallel.
• AllReduce & Broadcasts act as implicit barriers.
SyncSGD with Caffe2
16. • Train Resnet-50 (most popular image detection architecture) on Imagenet-1K
dataset in less than hour to ~ state-of-the-art accuracy.
• On a single 8-gpu P100: ~ 1.5 days to do 90 epochs.
• Why? (A) training faster improves development iterations;
• (B) enables training with extremely large datasets in reasonable time
Goal
June, 2017
17. • Accuracy
• Very large mini-batch sizes believed to hurt convergence
• Scale efficiently
• Facebook uses commodity networking (i.e no InfiniBand)
• 32 x 8 P100 NVIDIA GPUs, Big Basin architecture
Challenges
”Big Basin” architecture, open sourced design
22. Linear Scaling + Constant LR Warmup
rapid changes in the beginning of training-> use small LR for first few epochs
23. Linear Scaling + Gradual LR Warmup
start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂 after 5 epochs
24. Linear LR scaling
In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique.
More tricks in the paper.
26. Efficient All-Reduce
• Resnet-50: 25 million float parameters (100mb)
• Each iteration about ~0.3 secs, backward pass run in parallel with all-
reduces à Latency is not an issue.
• Halving-Doubling algorithm by Thakur et. al. provides optimal
throughput
27. 3x speedup vs. “ring-algorithm” for all-
reduce on 32 servers
28. Using 100% commodity hardware and open source software stack, 90% scaling efficiency
29. Follow-up Work by Others
• Already several follow-up papers to reproduce &
improve on our results
• For example: You, Gitman, Ginsburg demonstrate using
batch size up to 32K (using layer-wise adaptive learning
rate)
• Alternatives to GPUs, such as Intel Xeon Phis
30. On-going work
• Elasticity: survicve crashes; incrementally add nodes to
cluster then they become available
• Data input is becoming a bottleneck
• Fp16 for training
• Implement & Experiment with asynchronous algorithms
31. Lessons Learned
• SyncSGD can go a long way and has fewer tunable
parameters than asynchronous SGD.
• Learning Rate is the fundamental parameter when
increasing mini-batch size.
• Utilize the inherent parallelism in training to hide
latency.
• Commodity hardware can go a long way