4. H2O.ai
Machine Intelligence
H2O.ai
H2O Company
H2O Software
• Team: 35. Founded in 2012, Mountain View, CA
• Stanford Math & Systems Engineers
• Open Source Software
• Ease of Use via Web Interface
• R, Python, Scala, Spark & Hadoop Interfaces
• Distributed Algorithms Scale to Big Data
5. H2O.ai
Machine Intelligence
Scientific Advisory Council
Dr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University
• PhD in Statistics, Stanford University
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Co-author with John Chambers, Statistical Models in S
• Co-author, Generalized Additive Models
• 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University
• PhD in Statistics, Stanford University
• COPPS Presidents’ Award recipient
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Author, Regression Shrinkage and Selection via the Lasso
• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University
• PhD in Electrical Engineering and Computer Science, UC Berkeley
• Co-author, Convex Optimization
• Co-author, Linear Matrix Inequalities in System and Control Theory
• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers
7. H2O.ai
Machine Intelligence
What is Machine Learning?
What it is: ✤ “Field of study that gives computers the ability to learn
without being explicitly programmed.” (Samuel, 1959)
✤ “Machine learning and statistics are closely related
fields. The ideas of machine learning, from
methodological principles to theoretical tools, have
had a long pre-history in statistics.” (Jordan, 2014)
✤ M.I. Jordan also suggested the term data science as
a placeholder to call the overall field.
Unlike rules-based systems which require a human
expert to hard-code domain knowledge directly into
the system, a machine learning algorithm learns how
to make decisions from the data alone.
What it’s not:
8. H2O.ai
Machine Intelligence
Classification
Clustering
Machine Learning Overview
• Predict a real-valued response (viral load, weight)
• Gaussian, Gamma, Poisson and Tweedie
• MSE and R^2
• Multi-class or Binary classification
• Ranking
• Accuracy and AUC
• Unsupervised learning (no training labels)
• Partition the data / identify clusters
• AIC and BIC
Regression
10. H2O.ai
Machine Intelligence
ML Model Performance
Test & Train
• Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the “training set” and evaluate
performance on the “test set” or “validation set.”
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics
11. H2O.ai
Machine Intelligence
What is Deep Learning?
What it is: ✤ “A branch of machine learning based on a set of
algorithms that attempt to model high-level
abstractions in data by using model architectures,
composed of multiple non-linear
transformations.” (Wikipedia, 2015)
✤ Deep neural networks have more than one hidden
layer in their architecture. That’s what’s “deep.”
✤ Very useful for complex input data such as images,
video, audio.
Deep learning architectures, specifically artificial
neural networks (ANNs) have been around since
1980, so they are not new. However, there were
breakthroughs in training techniques that lead to their
recent resurgence (mid 2000’s). Combined with
modern computing power, they are quite effective.
What it’s not:
13. H2O.ai
Machine Intelligence
What is Ensemble Learning?
What it is: ✤ “Ensemble methods use multiple learning algorithms
to obtain better predictive performance that could be
obtained from any of the constituent learning
algorithms.” (Wikipedia, 2015)
✤ Random Forests and Gradient Boosting Machines
(GBM) are both ensembles of decision trees.
✤ Stacking, or Super Learning, is technique for
combining various learners into a single, powerful
learner using a second-level metalearning algorithm.
Ensembles typically achieve superior model
performance over singular methods. However, this
comes at a price — computation time.
What it’s not:
14. H2O.ai
Machine Intelligence
What is Data Science?
Problem
Formulation
• Identify an outcome of interest and the type of task:
classification / regression / clustering
• Identify the potential predictor variables
• Identify the independent sampling units
• Conduct research experiment (e.g. Clinical Trial)
• Collect examples / randomly sample the population
• Transform, clean, impute, filter, aggregate data
• Prepare the data for machine learning — X, Y
• Modeling using a machine learning algorithm (training)
• Model evaluation and comparison
• Sensitivity & Cost Analysis
• Translate results into action items
• Feed results into research pipeline
Collect &
Process Data
Machine Learning
Insights & Action
18. H2O.ai
Machine Intelligence
H2O Software Overview
Speed Matters!
No Sampling
Interactive UI
Cutting-Edge
Algorithms
• Time is valuable
• In-memory is faster
• Distributed is faster
• High speed AND accuracy
• Scale to big data
• Access data links
• Use all data without sampling
• Web-based modeling with H2O Flow
• Model comparison
• Suite of cutting-edge machine learning algorithms
• Deep Learning & Ensembles
• NanoFast Scoring Engine
19. H2O.ai
Machine Intelligence
Current Algorithm Overview
Statistical Analysis
• Linear Models (GLM)
• Cox Proportional Hazards
• Naïve Bayes
Ensembles
• Random Forest
• Distributed Trees
• Gradient Boosting Machine
• R Package - Super Learner
Ensembles
Deep Neural Networks
• Deep Learning
• Auto-encoder
• Anomaly Detection
• Deep Features
• Feed-Forward Neural Network
Clustering
• K-Means
Dimension Reduction
• Principal Component Analysis
Solvers & Optimization
• Generalized ADMM Solver
• L-BFGS (Quasi Newton
Method)
• Ordinary Least-Square Solver
• Stochastic Gradient Descent
Data Munging
• Plyr
• Integrated R-Environment
• Slice, Log Transform
26. H2O.ai
Machine Intelligence
Breast Cancer Diagnostics
Problem
Data
• Goal is to accurately diagnose breast masses
solely on a Fine Needle Aspiration (FNA).
• Binary outcome: Malignant vs. Benign
• Predictor variables describe characteristics of
the cell nuclei present in the FNA image.
• radius, texture, perimeter, area, smoothness,
compactness, concavity, concave points,
symmetry, fractal dimension (10 attributes).
• Calculate mean, standard error and “worst” for
each for a total of 30 features.
• Each training example is labeled as “Malignant”
or “Benign.”
Source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
27. H2O.ai
Machine Intelligence
Predict Malignancy
Problem
Formulation
• Outcome of interest: Malignancy (0/1)
• Machine Learning task: Binary Classification
• Sampling units: Randomly sampled patients who
have had a fine needle aspirate (FNA) image taken.
• Each patient in the training set contributes a digitized
image of a FNA of a breast mass.
• Nuclear feature extraction: In each image, calculate
the mean, standard error and largest values for
attributes such as “radius” and “texture.”
• Train several machine learning models (in H2O)
• Model evaluation and comparison
• Models are highly accurate (98-99%)
• This work was turned into software called “Xcyt”, which
estimates of the probability of malignancy for each case.
• The Xcyt system has been used at University of Wisconsin
Hospitals.
Collect &
Process Data
Machine Learning
Insights & Action
31. H2O.ai
Machine Intelligence
Super Learner
Why ensembles?
H2O Ensemble Overview
• Regression
• Binary Classification
• Coming soon: Support for multi-class
• H2O Ensemble implements the Super Learner
algorithm.
• The Super Learner algorithm finds the optimal
(based on defined loss) combination of a
collection of base learning algorithms.
• When a single algorithm does not approximate
the true prediction function well.
• Win Kaggle competitions!
ML Tasks