Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Intro to H2O Machine Learning in R at Santa Clara University
Next
Download to read offline and view in fullscreen.

Share

H2O with Erin LeDell at Portland R User Group

Download to read offline

Erin LeDell's presentation on scalable machine learning in R with H2O from the Portland R User Group Meetup in Portland, 08.17.15

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

H2O with Erin LeDell at Portland R User Group

  1. 1. H2O.ai
 Machine Intelligence Scalable Machine Learning in R with H2O Erin LeDell Ph.D. Portland R User Group August 2015
  2. 2. H2O.ai
 Machine Intelligence Introduction
  3. 3. H2O.ai
 Machine Intelligence Agenda • What/who is H2O.ai? • An Intro to Machine Learning • H2O Machine Learning & Demo • H2O Ensemble Learning & Demo
  4. 4. H2O.ai
 Machine Intelligence H2O.ai H2O Company H2O Software • Team: 35. Founded in 2012, Mountain View, CA • Stanford Math & Systems Engineers • Open Source Software
 • Ease of Use via Web Interface • R, Python, Scala, Spark & Hadoop Interfaces • Distributed Algorithms Scale to Big Data
  5. 5. H2O.ai
 Machine Intelligence Scientific Advisory Council Dr. Trevor Hastie Dr. Rob Tibshirani Dr. Stephen Boyd • John A. Overdeck Professor of Mathematics, Stanford University • PhD in Statistics, Stanford University • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Co-author with John Chambers, Statistical Models in S • Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar) • Professor of Statistics and Health Research and Policy, Stanford University • PhD in Statistics, Stanford University • COPPS Presidents’ Award recipient • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Author, Regression Shrinkage and Selection via the Lasso • Co-author, An Introduction to the Bootstrap • Professor of Electrical Engineering and Computer Science, Stanford University • PhD in Electrical Engineering and Computer Science, UC Berkeley • Co-author, Convex Optimization • Co-author, Linear Matrix Inequalities in System and Control Theory • Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
  6. 6. H2O.ai
 Machine Intelligence Intro to ML Part 1 of 4 Scalable Machine Learning in R with H2O
  7. 7. H2O.ai
 Machine Intelligence What is Machine Learning? What it is: ✤ “Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel, 1959) ✤ “Machine learning and statistics are closely related fields. The ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.” (Jordan, 2014) ✤ M.I. Jordan also suggested the term data science as a placeholder to call the overall field. Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone. What it’s not:
  8. 8. H2O.ai
 Machine Intelligence Classification Clustering Machine Learning Overview • Predict a real-valued response (viral load, weight) • Gaussian, Gamma, Poisson and Tweedie • MSE and R^2 • Multi-class or Binary classification • Ranking • Accuracy and AUC • Unsupervised learning (no training labels) • Partition the data / identify clusters • AIC and BIC Regression
  9. 9. H2O.ai
 Machine Intelligence Machine Learning Workflow Source: NLTK Example of a supervised machine learning workflow.
  10. 10. H2O.ai
 Machine Intelligence ML Model Performance Test & Train • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30) • Train a model using the “training set” and evaluate performance on the “test set” or “validation set.” • Train & test K models as shown. • Average the model performance over the K test sets. • Report cross- validated metrics. • Regression: R^2, MSE, RMSE • Classification: Accuracy, F1, H-measure • Ranking (Binary Outcome): AUC, Partial AUC K-fold Cross-validation Performance Metrics
  11. 11. H2O.ai
 Machine Intelligence What is Deep Learning? What it is: ✤ “A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015) ✤ Deep neural networks have more than one hidden layer in their architecture. That’s what’s “deep.” ✤ Very useful for complex input data such as images, video, audio. Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980, so they are not new. However, there were breakthroughs in training techniques that lead to their recent resurgence (mid 2000’s). Combined with modern computing power, they are quite effective. What it’s not:
  12. 12. H2O.ai
 Machine Intelligence Deep Learning Architecture Example of a deep neural net architecture.
  13. 13. H2O.ai
 Machine Intelligence What is Ensemble Learning? What it is: ✤ “Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms.” (Wikipedia, 2015) ✤ Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees. ✤ Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm. Ensembles typically achieve superior model performance over singular methods. However, this comes at a price — computation time. What it’s not:
  14. 14. H2O.ai
 Machine Intelligence What is Data Science? Problem Formulation • Identify an outcome of interest and the type of task: classification / regression / clustering • Identify the potential predictor variables • Identify the independent sampling units • Conduct research experiment (e.g. Clinical Trial) • Collect examples / randomly sample the population • Transform, clean, impute, filter, aggregate data • Prepare the data for machine learning — X, Y • Modeling using a machine learning algorithm (training) • Model evaluation and comparison • Sensitivity & Cost Analysis • Translate results into action items • Feed results into research pipeline Collect & Process Data Machine Learning Insights & Action
  15. 15. H2O.ai
 Machine Intelligence Source: marketingdistillery.com
  16. 16. H2O.ai
 Machine Intelligence H2O Platform Part 2 of 4 Scalable Machine Learning in R with H2O
  17. 17. H2O.ai
 Machine Intelligence H2O Software H2O is an open source, distributed, Java machine learning library. APIs are available for: R, Python, Scala & JSON
  18. 18. H2O.ai
 Machine Intelligence H2O Software Overview Speed Matters! No Sampling Interactive UI Cutting-Edge Algorithms • Time is valuable • In-memory is faster • Distributed is faster • High speed AND accuracy • Scale to big data • Access data links • Use all data without sampling • Web-based modeling with H2O Flow • Model comparison • Suite of cutting-edge machine learning algorithms • Deep Learning & Ensembles • NanoFast Scoring Engine
  19. 19. H2O.ai
 Machine Intelligence Current Algorithm Overview Statistical Analysis • Linear Models (GLM) • Cox Proportional Hazards • Naïve Bayes Ensembles • Random Forest • Distributed Trees • Gradient Boosting Machine • R Package - Super Learner Ensembles Deep Neural Networks • Deep Learning • Auto-encoder • Anomaly Detection • Deep Features • Feed-Forward Neural Network Clustering • K-Means Dimension Reduction • Principal Component Analysis Solvers & Optimization • Generalized ADMM Solver • L-BFGS (Quasi Newton Method) • Ordinary Least-Square Solver • Stochastic Gradient Descent Data Munging • Plyr • Integrated R-Environment • Slice, Log Transform
  20. 20. H2O.ai
 Machine Intelligence
  21. 21. H2O.ai
 Machine Intelligence H2O on Amazon EC2 H2O can easily be deployed on an Amazon EC2 cluster. The GitHub repository contains example scripts.
  22. 22. H2O.ai
 Machine Intelligence H2O R Interface R code example: Load data
  23. 23. H2O.ai
 Machine Intelligence H2O R Interface R code example: Train and Test a GBM
  24. 24. H2O.ai
 Machine Intelligence H2O Flow Interface
  25. 25. H2O.ai
 Machine Intelligence Oncology Demo Part 3 of 4 Scalable Machine Learning in R with H2O
  26. 26. H2O.ai
 Machine Intelligence Breast Cancer Diagnostics Problem Data • Goal is to accurately diagnose breast masses solely on a Fine Needle Aspiration (FNA). • Binary outcome: Malignant vs. Benign • Predictor variables describe characteristics of the cell nuclei present in the FNA image. • radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension (10 attributes). • Calculate mean, standard error and “worst” for each for a total of 30 features. • Each training example is labeled as “Malignant” or “Benign.” Source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
  27. 27. H2O.ai
 Machine Intelligence Predict Malignancy Problem Formulation • Outcome of interest: Malignancy (0/1) • Machine Learning task: Binary Classification • Sampling units: Randomly sampled patients who have had a fine needle aspirate (FNA) image taken. • Each patient in the training set contributes a digitized image of a FNA of a breast mass. • Nuclear feature extraction: In each image, calculate the mean, standard error and largest values for attributes such as “radius” and “texture.” • Train several machine learning models (in H2O) • Model evaluation and comparison • Models are highly accurate (98-99%) • This work was turned into software called “Xcyt”, which estimates of the probability of malignancy for each case. • The Xcyt system has been used at University of Wisconsin Hospitals. Collect & Process Data Machine Learning Insights & Action
  28. 28. H2O.ai
 Machine Intelligence Breast Cancer Data in H2O Flow
  29. 29. H2O.ai
 Machine Intelligence Live H2O Demo! https://gist.github.com/ledell Install H2O: install_h2o_simons.R Demo: wisc_diag_breast_cancer_h2o_demo.R
  30. 30. H2O.ai
 Machine Intelligence H2O Ensemble Part 4 of 4 Scalable Machine Learning in R with H2O
  31. 31. H2O.ai
 Machine Intelligence Super Learner Why ensembles? H2O Ensemble Overview • Regression • Binary Classification • Coming soon: Support for multi-class • H2O Ensemble implements the Super Learner algorithm. • The Super Learner algorithm finds the optimal (based on defined loss) combination of a collection of base learning algorithms. • When a single algorithm does not approximate the true prediction function well. • Win Kaggle competitions! ML Tasks
  32. 32. H2O.ai
 Machine Intelligence Super Learner: The setup
  33. 33. H2O.ai
 Machine Intelligence Super Learner: The Algorithm
  34. 34. H2O.ai
 Machine Intelligence H2O Ensemble R Interface R code example: Set up an H2O Ensemble
  35. 35. H2O.ai
 Machine Intelligence H2O Ensemble R Interface R code example: Train and Test an Ensemble
  36. 36. H2O.ai
 Machine Intelligence Live H2O Demo! https://gist.github.com/ledell Install H2O Ensemble: install_h2oEnsemble.R Demo: h2o_ensemble_higgs_demo.R
  37. 37. H2O.ai
 Machine Intelligence Where to learn more? • H2O Online Training (free): http://learn.h2o.ai • H2O Slidedecks: http://www.slideshare.net/0xdata • H2O Video Presentations: https://www.youtube.com/user/0xdata • H2O Community Events & Meetups: http://h2o.ai/events • Machine Learning & Data Science courses: http://coursebuffet.com
  38. 38. H2O.ai
 Machine IntelligenceCustomers • Community • Evangelists 35 Speakers Training 2-Full Days Nov. 9 - 11 http://world.h2o.ai 20% Discount code: h2ocommunity
  39. 39. H2O.ai
 Machine Intelligence Questions? @ledell on Twitter, GitHub erin@h2o.ai http://www.stat.berkeley.edu/~ledell
  • Vhjkjh

    Jun. 29, 2018
  • JayHyunwooJeong

    Mar. 31, 2016
  • DanielEmaasit

    Nov. 3, 2015
  • SimonHo4

    Oct. 10, 2015
  • BowenLi3

    Aug. 20, 2015
  • alexsbresler

    Aug. 19, 2015

Erin LeDell's presentation on scalable machine learning in R with H2O from the Portland R User Group Meetup in Portland, 08.17.15

Views

Total views

2,177

On Slideshare

0

From embeds

0

Number of embeds

233

Actions

Downloads

35

Shares

0

Comments

0

Likes

6

×