Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
2. Hopsworks Technical Milestones
“If you’re working with big data and Hadoop, this one paper could repay your investment in
the Morning Paper many times over.... HopsFS is a huge win.”
- Adrian Colyer, The Morning Paper
World’s first Hadoop
platform to support
GPUs-as-a-Resource
World’s fastest
HDFS Published at
USENIX FAST with
Oracle and Spotify
World’s First
Open Source Feature
Store for Machine
Learning
World’s First
Distributed Filesystem to
store small files in
metadata on NVMe disks
Winner of IEEE
Scale
Challenge 2017
with HopsFS -
1.2m ops/sec
2017
World’s most scalable
POSIX-like Hierarchical
Filesystem with
Multi Data Center Availability
with 1.6m ops/sec on GCP
2018 2019
First non-Google ML
Platform with
TensorFlow Extended
(TFX) support through
Beam/Flink
World’s First
Unified Hyperparam
and Ablation Study
Parallel Prog.`
Framework
3. Evolution of Distributed Filesystems
POSIXPast
Present
NFS, HDFS S3 GCS HopsFS
Single DC,
Strongly
Consistent
Metadata
Multi-DC,
Eventually
Consistent
Metadata
Multi-DC,
Strongly
Consistent
Metadata
Object
Store
POSIX-like
Filesystem
4. Why HopsFS?
● Distributed FS
○ Needed for Parallel ML Experiments / Dist Training / FeatureStore
● Provenance/Free-text-search
○ Change Data Capture API
● Performance
○ > 1.6m ops/sec over 3 Azes on GCP using Spotify’s Hadoop workload
○ NVMe for small files stored in metadata
● HDFS API
○ TensorFlow/Keras/PySpark/Beam/Flink/PyTorch (Petastorm)
11. What is Hopsworks?
Elasticity & Performance Governance & ComplianceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Notebooks for Development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS
12. Which services require Distributed Metadata (HopsFS)?
Elasticity & Performance Governance & ComplianceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Notebooks for Development
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, MLeap, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS
21. Hopsworks Feature Store
Feature Mgmt Storage Access
Statistics
Online
Features
Discovery
Offline
Features
Data Scientist
Online Apps
Data Engineer
MySQL Cluster
(Metadata,
Online Features)
Apache Hive
Columnar DB
(Offline Features)
Feature Data
Ingestion
Hopsworks Feature Store
Training Data
(S3, HDFS)
Batch Apps
Discover features,
create training data,
save models,
read online/offline/on-
demand features,
historical feature values.
Models
HopsFS
JDBC
(SAS, R, etc)
Feature
CRUD
Add/remove features,
access control,
feature data validation.
Access
Control
Time Travel
Data
Validation
Pandas or
PySpark
DataFrame
External DB
Feature Defn
select ..
23. Register a Feature Group with the Feature Store
titanic_df = # Spark or Pandas Dataframe
# Do feature engineering on ‘titanic_data’
# Register Dataframe as FeatureGroup
featurestore.create_featuregroup(titanic_df,
"titanic_data_passengers“)
24. Create Training Datasets using the Feature Store
sample_data = featurestore.get_features([“name”, “Pclass”,
“Sex”, “balance”])
featurestore.create_training_dataset(sample_data,
“titanic_training_dataset", data_format="tfrecords“,
training_dataset_version=1)
# Use the training dataset
dataset_dir = featurestore.get_training_dataset_path(
"titanic_training_dataset")
s = featurestore.get_training_dataset_tf_record_schema(
“titanic_training_dataset”)
31. Explicit vs Implicit Provenance
● Explicit: TFX, MLFlow
○ Wrap existing code in components
that execute a stage in the pipeline
○ Interceptors in components inject
metadata to a metadata store as
data flows through the pipeline
○ Store metadata about artefacts and
executions
● Implicit: Hopsworks with ePipe*
DataPrep Train
HopsFS
Experiment
/Training_
Datasets
/Experiments /Models
Elasticsearch
ePipe: ChangeDataCapture API
*ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
32. Provenance in Hopsworks
● Implicit Provenance
○ Applications read/write files to HopsFS – infer artefacts from pathname conventions and xattrs
● Explicit Provenance in Hops API
○ Executions:
hops.experiment.grid_search(…), hops.experiment.collective_allreduce(…)
○ FeatureStore:
hops.featurestore.create_training_dataset(….)
○ Saving and deploying models:
hops.serving.export(export_path, “model_name", 1)
hops.serving.deploy(“…/model.pb”, destination)
36. /Models
● Named/versioned model
management for:
TensorFlow/Keras
Scikit Learn
● A Models dataset can be
securely shared with other
projects or the whole cluster
● The provenance API returns the
conda.yml and execution
used to train a given model
/Projects/MyProj
└ Models
└ <name>
└ <version>
├─ saved_model.pb
└─ variables/
...
37. Model Serving
● Model Serving on Google
Cloud AI Platform
○ Export models to Cloud
Storage buckets
● On-Premise
○ Serve models on
Kuberenetes
○ TensorFlow Serving,
Scikit-Learn
39. Principles of Development with Notebooks
● No throwaway code
● Code can be run either in Notebooks or as Jobs (in Pipelines)
● Notebooks/jobs should be parameterizable
● No external configuration for program logic
○ HPARAMs are part of the program
● Core training loop should be the same for all stages of development
○ Data exploration, hparam tuning, dist training,
40. Problem in ML Pipeline Development Process?
Explore Data,
Train model
Hyperparam
Opt.
Distributed
Training
Notebook Python + YML in Gitport/rewrite code
Iteration is hard/impossible/a-bad-idea
41. Notebooks offer more than Prototyping
● Familiar web-based development
environment
● Interactive development/debugging
● Reporting platform for ML
applications (Papermill by Netflix)
● Parameterizable as part of Pipelines
(Papermill by Netflix)
Disclaimer: Notebooks are not for everyone
42. ML Pipelines of Jupyter Notebooks with Airflow
Select
Features,
File Format
Feature
Engineering
Validate &
Deploy Model
Experiment,
Train Model
Airflow
Dataprep Pipeline Training and Deployment Pipeline
Feature
Store
46. GPU(s) in PySpark Executor, Driver coordinates
PySpark makes it easier to
write TensorFlow/Keras/
PyTorch code that can either
be run on a single GPU or
scale to run on lots of GPUS
for Parallel Experiments or
Distributed Training.
Executor Executor
Driver
47. Executor 1 Executor N Driver
HopsFS
• Training/Test Datasets
• Model checkpoints, Trained Models
• Experiment run data
• Provenance data
• Application logs
Need Distributed Filesystem for Coordination
Model
Serving
TensorBoard
50. Leave code unchanged, but configure 4 Executors
print(“Hello
from GPU”)
Driver
print(“Hello
from GPU”)
print(“Hello
from GPU”)
print(“Hello
from GPU”)
55. TensorFlow Distributed Training with PySpark
def train():
# Separate shard of dataset per worker
# create Estimator w/ DistribStrategy
# as CollectiveAllReduce
# train model, evaluate
return loss
# Driver code below here
# builds TF_CONFIG and shares to workers
from hops import experiment
experiment.collective_allreduce(train)
More details: http//github.com/logicalclocks/hops-examples
56. Undirected Hyperparam Search with PySpark
def train(dropout):
# Same dataset for all workers
# create model and optimizer
# add this worker’s value of dropout
# train model and evaluate
return loss
# Driver code below here
from hops import experiment
args={“dropout”:[0.1, 0.4, 0.8]}
experiment.grid_search(train,args)
More details: http//github.com/logicalclocks/hops-examples
57. Directed Hyperparam Search with PySpark
def train(dropout):
# Same dataset for all workers
# create model and optimizer
optimizer.apply(dropout)
# train model and evaluate
return loss
from hops import experiment
args={“dropout”: “0.1-0.8”}
experiment.diff_ev(train,args)
More details: http//github.com/logicalclocks/hops-examples
60. PClass Sexname survive
Iterative Model Development
Dataset
Machine
Learning
Model
Optimizer
Evaluate
Problem Definition
Data Preparation
Model Selection
Hyperparameters
Repeat if
needed
Model Training
61. Maggy: Unified Hparam Opt & Ablation Programming
Machine
Learning
System
Hyperparameter
Optimizer
New Hyperparameter Values
Evaluate
New Dataset/Model-Architecture
Ablation Study
Controller
Synchronous or
Asynchronous Trials
Directed or
Undirected Search
User-Defined
Search/Optimizers
62. Maggy – Async, Parallel ML experiments using PySpark
● Experiment-Driven, interactive
development of ML applications
● Parallel Experimentation
○ Hyperparameter Optimization
○ Ablation Studies
○ Leave-one-Feature-out
● Interactive Debugging
○ Experiment/executor logs shown both
in Jupyter notebooks and logging
63. Feature Ablation
● Uses the Feature Store to access the dataset metadata
● Generates Python callables that once called will create a dataset
○ Removes one-feature-at-a-time
Sexname survivePClass Sexname survive
64. Layer Ablation
● Uses a base model function
● Generates Python callables that once called with create a modified model
○ Uses the model configuration to find and remove layer(s)
○ Removes one-layer-at-a-time (or one-layer-group-at-a-time)
70. Hopsworks on GCP
● Hopsworks available as an Image for GCP
○ Blog post coming soon with details
● Next Steps
○ Cloud Native Features
■ Elastic add/remove resources (compute/GPU)
■ API integration with Google Model serving
71. That was Hopsworks
Efficiency & Performance Security & GovernanceDevelopment & Operations
Secure Multi-Tenancy
Project-based restricted access
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Development Environment
First-class Python Support
Version Everything
Code, Infrastructure, Data
Model Serving on Kubernetes
TF Serving, SkLearn
End-to-End ML Pipelines
Orchestrated by Airflow
Feature Store
Data warehouse for ML
Distributed Deep Learning
Faster with more GPUs
HopsFS
NVMe speed with Big Data
Horizontally Scalable
Ingestion, DataPrep,
Training, Serving
FS
73. Acknowledgements and References
Slides and Diagrams from colleagues:
● Maggy: Moritz Meister and Sina Sheikholeslami
● Feature Store: Kim Hammar
● Beam/Flink on Hopsworks: Theofilos Kakantousis
References
● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017.
● Size matters: Improving the performance of small files …, ACM Middleware 2018.
● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
● Hopsworks Demo, SysML 2019.
74. Thank you!
470 Ramona St
Palo Alto
https://www.logicalclocks.com
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops