SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Horovod
Distributed TensorFlow Made Easy
Alex Sergeev, Machine Learning Platform, Uber Engineering
Deep Learning @ Uber
● Self-Driving Vehicles
● Trip Forecasting
● Fraud Detection
● … and many more!
TensorFlow
● Most popular open source framework for Deep Learning
● Combines high performance with ability to tinker with low
level model details
● Has end-to-end support from research to production
Going Distributed
● Speed up model training
● Train very large models
● Vast majority of use cases are
data-parallel
● Facebook demonstrated
training ResNet on ImageNet
in 1 hour
Parameter Server Technique
tf.Server()
tf.ClusterSpec()
tf.train.replicas_device_setter()
tf.train.SyncReplicasOptimizer()
Parameter Server
Worker GPU Towers
Parameter Server Technique - Example Script
Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
Parameter Server Technique - Performance
Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one
epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
How Can We Do Better?
● Re-think necessary complexity for data-parallel case
● Improve communication algorithm
● Use RDMA-capable networking (RoCE, InfiniBand)
Meet Horovod
● Distributed training framework for TensorFlow
● Inspired by work of Baidu, Facebook, et al.
● Uses bandwidth-optimal communication protocols
○ Makes use of RDMA (RoCE, InfiniBand) if available
● Seamlessly installs on top of TensorFlow via
pip install horovod
● Named after traditional Russian folk dance where
participants dance in a circle with linked hands
Horovod Technique
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations.
Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod Example Cont.
● Run on a 4 GPU machine:
○ $ mpirun -np 4 python train.py
● Run on 4 machines with 4 GPUs each using Open MPI:
○ $ mpirun -np 16 -x LD_LIBRARY_PATH 
-H server1:4,server2:4,server3:4,server4:4 
python train.py
Debugging - Horovod Timeline
● Discovered that ResNet-152 has a lot of tiny tensors
● Added Tensor Fusion - smart batching that gives large
gains (bigger gain on less optimized networks)
Horovod Performance
With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes.
Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
Horovod Performance Cont.
RDMA further helps to improve efficiency - by 30% for VGG-16.
Practical Results
● Used learning rate adjustment technique described in the
Facebook paper “Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour”
● Trained convolutional networks and LSTMs in hours
instead of days or weeks with the same final accuracy
● You can do that, too!
Giving Back
Horovod is available on GitHub today
https://github.com/uber/horovod
Thank you!
Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod
Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

Contenu connexe

Tendances

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
CI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdCI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdBilly Yuen
 
Introduction to char device driver
Introduction to char device driverIntroduction to char device driver
Introduction to char device driverVandana Salve
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPThomas Graf
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangCeph Community
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Streaming, Database & Distributed Systems Bridging the Divide
Streaming, Database & Distributed Systems Bridging the DivideStreaming, Database & Distributed Systems Bridging the Divide
Streaming, Database & Distributed Systems Bridging the DivideBen Stopford
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in AndroidOpersys inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Google Cloud Composer
Google Cloud ComposerGoogle Cloud Composer
Google Cloud ComposerPierre Coste
 
PostgreSQL continuous backup and PITR with Barman
 PostgreSQL continuous backup and PITR with Barman PostgreSQL continuous backup and PITR with Barman
PostgreSQL continuous backup and PITR with BarmanEDB
 

Tendances (20)

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
GitLab.pptx
GitLab.pptxGitLab.pptx
GitLab.pptx
 
CI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdCI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cd
 
Introduction to char device driver
Introduction to char device driverIntroduction to char device driver
Introduction to char device driver
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
Block Drivers
Block DriversBlock Drivers
Block Drivers
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Streaming, Database & Distributed Systems Bridging the Divide
Streaming, Database & Distributed Systems Bridging the DivideStreaming, Database & Distributed Systems Bridging the Divide
Streaming, Database & Distributed Systems Bridging the Divide
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Google Cloud Composer
Google Cloud ComposerGoogle Cloud Composer
Google Cloud Composer
 
PostgreSQL continuous backup and PITR with Barman
 PostgreSQL continuous backup and PITR with Barman PostgreSQL continuous backup and PITR with Barman
PostgreSQL continuous backup and PITR with Barman
 

Similaire à Horovod - Distributed TensorFlow Made Easy

Uber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep LearningUber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep Learninginside-BigData.com
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from UberBill Liu
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
 
2018 TensorFlow Summit Recap (GDG Shanghai)
2018 TensorFlow Summit Recap (GDG Shanghai)2018 TensorFlow Summit Recap (GDG Shanghai)
2018 TensorFlow Summit Recap (GDG Shanghai)Jiang Jun
 
Leonid Kuligin "Training ML models with Cloud"
 Leonid Kuligin   "Training ML models with Cloud" Leonid Kuligin   "Training ML models with Cloud"
Leonid Kuligin "Training ML models with Cloud"Lviv Startup Club
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT Mia Chang
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.Provectus
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.Umang Sharma
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlowPR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlowSeoul National University
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in ProductionMatthias Feys
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model TrainingAlluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model TrainingAlluxio, Inc.
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
2017 arab wic marwa ayad machine learning
2017 arab wic marwa ayad machine learning2017 arab wic marwa ayad machine learning
2017 arab wic marwa ayad machine learningmarwa Ayad Mohamed
 

Similaire à Horovod - Distributed TensorFlow Made Easy (20)

Uber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep LearningUber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep Learning
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
 
2018 TensorFlow Summit Recap (GDG Shanghai)
2018 TensorFlow Summit Recap (GDG Shanghai)2018 TensorFlow Summit Recap (GDG Shanghai)
2018 TensorFlow Summit Recap (GDG Shanghai)
 
C3 w3
C3 w3C3 w3
C3 w3
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Leonid Kuligin "Training ML models with Cloud"
 Leonid Kuligin   "Training ML models with Cloud" Leonid Kuligin   "Training ML models with Cloud"
Leonid Kuligin "Training ML models with Cloud"
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlowPR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model TrainingAlluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio Webinar - Maximize GPU Utilization for Model Training
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
2017 arab wic marwa ayad machine learning
2017 arab wic marwa ayad machine learning2017 arab wic marwa ayad machine learning
2017 arab wic marwa ayad machine learning
 

Dernier

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Dernier (20)

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

Horovod - Distributed TensorFlow Made Easy

  • 1. Horovod Distributed TensorFlow Made Easy Alex Sergeev, Machine Learning Platform, Uber Engineering
  • 2. Deep Learning @ Uber ● Self-Driving Vehicles ● Trip Forecasting ● Fraud Detection ● … and many more!
  • 3. TensorFlow ● Most popular open source framework for Deep Learning ● Combines high performance with ability to tinker with low level model details ● Has end-to-end support from research to production
  • 4. Going Distributed ● Speed up model training ● Train very large models ● Vast majority of use cases are data-parallel ● Facebook demonstrated training ResNet on ImageNet in 1 hour
  • 6. Parameter Server Technique - Example Script Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
  • 7. Parameter Server Technique - Performance Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
  • 8. How Can We Do Better? ● Re-think necessary complexity for data-parallel case ● Improve communication algorithm ● Use RDMA-capable networking (RoCE, InfiniBand)
  • 9. Meet Horovod ● Distributed training framework for TensorFlow ● Inspired by work of Baidu, Facebook, et al. ● Uses bandwidth-optimal communication protocols ○ Makes use of RDMA (RoCE, InfiniBand) if available ● Seamlessly installs on top of TensorFlow via pip install horovod ● Named after traditional Russian folk dance where participants dance in a circle with linked hands
  • 10. Horovod Technique Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
  • 11. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 12. Horovod Example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs", config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 13. Horovod Example Cont. ● Run on a 4 GPU machine: ○ $ mpirun -np 4 python train.py ● Run on 4 machines with 4 GPUs each using Open MPI: ○ $ mpirun -np 16 -x LD_LIBRARY_PATH -H server1:4,server2:4,server3:4,server4:4 python train.py
  • 14. Debugging - Horovod Timeline ● Discovered that ResNet-152 has a lot of tiny tensors ● Added Tensor Fusion - smart batching that gives large gains (bigger gain on less optimized networks)
  • 15. Horovod Performance With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes. Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
  • 16. Horovod Performance Cont. RDMA further helps to improve efficiency - by 30% for VGG-16.
  • 17. Practical Results ● Used learning rate adjustment technique described in the Facebook paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” ● Trained convolutional networks and LSTMs in hours instead of days or weeks with the same final accuracy ● You can do that, too!
  • 18. Giving Back Horovod is available on GitHub today https://github.com/uber/horovod
  • 19. Thank you! Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
  • 20. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.