The Bitter Lesson of ML Pipelines

The Bitter Lesson of
ML Pipelines
jim_dowling
CEO @ Logical Clocks
Assoc Prof @ KTH
Senior Research @ RISE
WASP4ALL – Future Computing Platforms for X
Nov 2019

“Methods that scale with computation
are the future of AI”*
Rich Sutton (Founding Father of Reinforcement Learning), May 2018
* https://www.youtube.com/watch?v=EeMCEQa85tw

Massive Increase in Compute for AI*
3/38
Distributed Systems
3.5 month-doubling time
*https://blog.openai.com/ai-and-compute

Distributed Systems are important for Deep Learning
Distributed
Deep Learning
Hyper
Parameter
Optimization
Distributed
Training
Larger
Training
Datasets
Elastic
Model
Serving
Parallel
Experiments
(Commodity)
GPU Clusters
Auto
ML

The Bitter Lesson
“The biggest lesson .. is that general methods that
leverage computation are ultimately the most
effective, and by a large margin…
The two (general purpose) methods that seem to
scale ... are search and learning.”
Rich Sutton, March 2019
http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Learning needs structure
● In learning theory, the No Free
Lunch theorem* tells us that
without structure (innate priors), it
is very difficult to learn anything.
● Warning! Structure is not free - it
adds assumptions about the data
that may not hold for all of your
data.
*Free lunch today
for all WASP4ALL
attendees

What do you mean by Structure?
● By structure, we mean prior knowledge
○ Not just a prior probability
1. Innate priors
○ A linear model assumes the data is linear.
○ The convolution/pooling assumption in Convolutional
Neural Nets.
2. Some structure can be computed dynamically
○ Semi-/self-supervised learning

The Trend: Less Structure and More Data/Compute
● There is a trade off between the
amount of structure you need to put in
your learning systems and the amount
of training data and compute available.
● Recent increases in the amount of
available training data for supervised
ML and decreasing sample complexity
for some reinforcement learning
domains means you need less
structure.
Structure
Data
Compute

Self-supervised is SoTA in Image Classification
Pre-trained with 3.5B
weakly labeled
Instagram images
using 256 TPUs v3 for
3.5 days.

Self-supervised is SoTA in NLP
https://arxiv.org/pdf/1907.11692.pdf

Self-supervised is SoTA in AML
● 40 TB of data
● Massive class imbalance
● Semi-supervised

Not all Structure can be learned….
● We need the meta-methods that can find and capture complexity
● For Deep Learning, these meta-methods must scale on GPUs
○ Convolutional Neural Network
○ Transformer

Structure that doesn’t scale (yet): Capsule Networks*
Algorithmic bottlenecks for GPUs*:
“votes are ‘routed’ using the ExpectationMaximization algorithm”
ML Framework limitations*:
“[ml frameworks] are structured around calls to large monolithic kernels”
*Machine Learning Systems are Stuck in a Rut, Barham P. and Isard M, HotOS’19
ConvNet
CapsuleNet
CuDNN Kernels
TensorFlow
XLA
WARP Threads
SIMD Lanes (SMs)
User Programs
The TensorFlow/Cuda Stack

Searching for Structure
● We can also search for better
hyperparameters with Genetic algorithms,
reinforcement learning, etc
ImageNet SoTA, March 2018 (Quoc Le et Al)

The Bitter Lesson as a Research Roadmap
1. Scale out data and computation to reduce the amount of structure.
○ Learn as much structure as possible.
2. Structure we introduce should be minimal meta-methods that scale-
out on both accelerators and distributed systems.

Distributed Systems Research on ML at KTH/RISE
● Continuous Deep Analytics
○ ARCON (RISE, KTH – P. Carbone, S. Haridi)
● Distributed Deep Learning
○ Hopsworks (Logical Clocks AB, KTH –J. Dowling, V. Vlassov, A. Payberah)
● Scalable Data Management for ML
○ HopsFS and the Feature Store (Logical Clocks AB)
https://dcatkth.github.io/

Hopsworks – Award Winning AI Platform

Search: Parallel Hyperparameter Tuning with Maggy
Learning
Black Box
Metric
Meta-level
learning &
optimization Parallel
WorkersQueue
Trial
Trial
Search space
https://databricks.com/session_eu19/asynchronous-hyperparameter-optimization-with-apache-sparkMoritz Meister

Synchronous Parallel Trials with PySpark
Trial11
Driver
Trial12
Trial13
Trial1N
…
HDFS
Trial21
Trial22
Trial23
Trial2N…
Barrier
Barrier
Trial31
Trial32
Trial33
Trial3N
…
Barrier
Metrics1 Metrics2 Metrics3

Synchronous Parallel Trials with Early Stopping
Trial11
Driver
Trial12
Trial13
Trial1N
…
HDFS
Trial21
Trial22
Trial23
Trial2N…
Barrier
Barrier
Trial31
Trial32
Trial33
Trial3N
…
Barrier
Metrics1 Metrics2 Metrics3
Wasted Compute Wasted ComputeWasted Compute
Early Stop

Problem: PySpark is inefficient with Early Stopping
● PySpark’s bulk-synchronous execution model prevents efficient use of
early-stopping for hyperparameter optimization.
New Framework? Fix PySpark?

Solution: Long Running Tasks and a RPC framework
Trial11
Driver (Optimizer)
Trial12
Trial13
Trial1N
…
Barrier
Metrics
New Trial

Results
Hyperparameter Optimization Trial ASHA Validation Trial
ASHA
RS-ES
RS-NS
ASHA
RS-ES
RS-NS

Parallel Ablation Studies
PClassname survivesex sexname survive
Replacing the Maggy Optimizer with an Ablator:
● Feature Ablation using
the Feature Store
● Leave-One-Layer-Out Ablation
● Leave-One-Component-Out (LOCO)
Sina Sheikholeslami https://castor-software-days-2019.github.io/sina

Production ML Applications
are Pipelines.

Hopsworks End-to-End ML Pipelines
Data
Pipelines
Ingest & Prep
Feature
Store
Machine Learning Experiments
Data Parallel
Training
Model
Serving
Ablation
Studies
Hyperparameter
Optimization
Bottleneck, due to
• iterative nature
• human-in-the-loop

DataPrep Pipelines and Model Training Pipelines
Select
Features
Feature
Engineering
Validate &
Deploy Model
Experiment,
Train Model
Dataprep Pipeline Training and Deployment Pipeline
Feature
Store
Airflow Airflow

www.hops.site
RISE Data Center
1 PB storage,
24 GPUs
2000 CPUs
1500+ Users
Register for a free account with your student/work email address:
www.hops.site

Hopsworks
Efficiency & Performance Security & GovernanceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Development Environment
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS

Acknowledgements and References
Slides and Diagrams from colleagues:
● Maggy: Moritz Meister, Sina Sheikholeslami, Robin Andersson, Kim Hammar
References
● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017.
● Size matters: Improving the performance of small files …, ACM Middleware 2018.
● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
● Hopsworks Demo, SysML 2019.

Systems Conferences for Machine Learning

WASP Course on Large Scale Machine Learning
● http://wasp-sweden.org/large-scale-machine-learning-6-credits/
○ Dr. Raazesh Sainudiin and Dr. Amir Payberah
○ Autumn 2020

Thank you!
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops

The Bitter Lesson of ML Pipelines

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à The Bitter Lesson of ML Pipelines

Similaire à The Bitter Lesson of ML Pipelines (20)

Plus de Jim Dowling

Plus de Jim Dowling (17)

Dernier

Dernier (20)

The Bitter Lesson of ML Pipelines