Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. H2O.ai
Machine Intelligence
H2O.ai
H2O Company
H2O Software
• Team: 35. Founded in 2012, Mountain View, CA
• Stanford Math & Systems Engineers
• Open Source Software
• Ease of Use via Web Interface
• R, Python, Scala, Spark & Hadoop Interfaces
• Distributed Algorithms Scale to Big Data
3. H2O.ai
Machine Intelligence
Scientific Advisory Council
Dr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University
• PhD in Statistics, Stanford University
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Co-author with John Chambers, Statistical Models in S
• Co-author, Generalized Additive Models
• 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University
• PhD in Statistics, Stanford University
• COPPS Presidents’ Award recipient
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Author, Regression Shrinkage and Selection via the Lasso
• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University
• PhD in Electrical Engineering and Computer Science, UC Berkeley
• Co-author, Convex Optimization
• Co-author, Linear Matrix Inequalities in System and Control Theory
• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers
4. H2O.ai
Machine Intelligence
What is Data Science?
Problem
Formulation
• Identify an outcome of interest and the type of task:
classification / regression / clustering
• Identify the potential predictor variables
• Identify the independent sampling units
• Conduct research experiment (e.g. Clinical Trial)
• Collect examples / randomly sample the population
• Transform, clean, impute, filter, aggregate data
• Prepare the data for machine learning — X, Y
• Modeling using a machine learning algorithm (training)
• Model evaluation and comparison
• Sensitivity & Cost Analysis
• Translate results into action items
• Feed results into research pipeline
Collect &
Process Data
Machine Learning
Insights & Action
6. H2O.ai
Machine Intelligence
What is Machine Learning?
What it is: ✤ “Field of study that gives computers the ability to learn
without being explicitly programmed.” (Samuel, 1959)
✤ “Machine learning and statistics are closely related
fields. The ideas of machine learning, from
methodological principles to theoretical tools, have
had a long pre-history in statistics.” (Jordan, 2014)
✤ M.I. Jordan also suggested the term data science as
a placeholder to call the overall field.
Unlike rules-based systems which require a human
expert to hard-code domain knowledge directly into
the system, a machine learning algorithm learns how
to make decisions from the data alone.
What it’s not:
7. H2O.ai
Machine Intelligence
Classification
Clustering
Machine Learning Overview
• Predict a real-valued response (viral load, weight)
• Gaussian, Gamma, Poisson and Tweedie
• MSE and R^2
• Multi-class or Binary classification
• Ranking
• Accuracy and AUC
• Unsupervised learning (no training labels)
• Partition the data / identify clusters
• AIC and BIC
Regression
9. H2O.ai
Machine Intelligence
ML Model Performance
Test & Train
• Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the “training set” and evaluate
performance on the “test set” or “validation set.”
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics
10. H2O.ai
Machine Intelligence
What is Deep Learning?
What it is: ✤ “A branch of machine learning based on a set of
algorithms that attempt to model high-level
abstractions in data by using model architectures,
composed of multiple non-linear
transformations.” (Wikipedia, 2015)
✤ Deep neural networks have more than one hidden
layer in their architecture. That’s what’s “deep.”
✤ Very useful for complex input data such as images,
video, audio.
Deep learning architectures, specifically artificial
neural networks (ANNs) have been around since
1980, so they are not new. However, there were
breakthroughs in training techniques that lead to their
recent resurgence (mid 2000’s). Combined with
modern computing power, they are quite effective.
What it’s not:
12. H2O.ai
Machine Intelligence
What is Ensemble Learning?
What it is: ✤ “Ensemble methods use multiple learning algorithms
to obtain better predictive performance that could be
obtained from any of the constituent learning
algorithms.” (Wikipedia, 2015)
✤ Random Forests and Gradient Boosting Machines
(GBM) are both ensembles of decision trees.
✤ Stacking, or Super Learning, is technique for
combining various learners into a single, powerful
learner using a second-level metalearning algorithm.
Ensembles typically achieve superior model
performance over singular methods. However, this
comes at a price — computation time.
What it’s not:
13. H2O.ai
Machine Intelligence
Where to learn more?
• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Learning & Data Science courses: http://coursebuffet.com
14. Customers ! Community ! Evangelists
November 9, 10, 11
Computer History Museum
H 2 O W O R L D . H 2 O . A I
!
20% off registration
using code:
h2ocommunity
!
20. Who am I?
• Data Strategist
• Career in Business Intelligence,
Analytics, and Big Data
• Various roles
• Consultant
• Developer
• Business and Data Analyst
• Product Manager
• Functional and Technical Trainer
• Client Services
• Worked in various industries
• Health care, pharmaceutics,
communications and high tech,
consumer products, automotive,
finance, government contracting
August, 2015 – San Francisco, CA
21. Why am I giving this talk?
July, 2011 – Beijing, China
22. Data Science Primer
• What can Data Science do for the Business?
• Applications of Data Science
• Data-Driven Decisions
• What does a Data Scientist do?
• Data Science Skills
23. What can Data Science do for the
Business?
A: Data science! Extracting useful
information and knowledge from large
volumes of data in order to improve
business decision-making or
providing the business insights to make
data-driven decisions
DataBusiness
24. What can Data do?
Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
25. Applications of Data Science
Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
26. Data-Driven Decisions
• Practice of basing decisions on data, rather than purely
on intuition
• There is evidence that data-driven decision making and
big data technologies substantially improve business
performance
27. The Art and Science of Data Science
• Discover unknowns in data
• Obtain predictive, actionable insights
• Communicate business data stories
• Build confidence in decision making
• Create valuable Data Products that has business
impacts
http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do
28.
29. What does a Data Scientist do?
• Data curiosity. Explore data. Discover unknowns
• Understand data relationships
• Understand the business, has domain knowledge
• Can tell relevant stories with data
• Holistic view of the business
• Knows machine learning, statistics, probability
• Can hack and code
• Define and test an hypothesis, run experiences
• Asks good questions
http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
39. Machine Learning
• A subfield of computer science
and artificial intelligence (AI) that
focuses on the design of
systems that can learn from and
make decisions and predictions
based on data.
• Machine learning enables
computers to act and make
data-driven decisions rather than
being explicitly programmed to
carry out a certain task.
• Machine Learning programs are
also designed to learn and
improve over time when
exposed to new data.
• Everything!
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
40. Unsupervised Learning
Data Science Definition:
• Where a program, given a
dataset, can automatically find
patterns and relationships
within the dataset.
• The business will decide how
deeply or many categories
there are.
• Clustering or grouping of like
data.
• Examples: k-means clustering,
hierarchical clustering
Business Application:
• Customer segmentation
• Understanding users and
behaviors
• Classifying unknown and pre-
defined images into categories
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
41. Supervised Learning
• Where a program is “trained”
on a pre-defined dataset.
• Based off its training data the
program can make accurate
decisions when given new
data.
• Classifying Twitter sentiments
• Recommender systems
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
42. Score
• Number of ways to evaluate
how well the model assigns the
correct class value to the test
instances.
• Confidence gauge
Data Science Definition: Business Application:
Definition: https://mlcorner.wordpress.com/tag/scoring/
43. Score Cont.
• True Positive (TP): If the instance
is positive and it is classified as
positive False
• Negative (FN): If the instance is
positive but it is classified as
negative True
• Negative (TN): If the instance is
negative and it is classified as
negative False
• Positive (FP): If the instance is
negative but it is classified as
positive
• Classification problems:
• Precision = the number of times you correctly classify = TP/(TP+FP)
• Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN
+FP+FN)
• Recall or Sensitivity = the number of positive that you correctly classify out
of all the actual positives = TP/(TP+FN)
• Specificity = classifier’s ability to identify negative results = TN/(TN+FP)
44. Classification
• Sub-category of Supervised
Learning
• Classification is the process of
taking some sort of input and
assign a label to it. The
predictions are discrete,
categories, or “yes or no”
nature.
• Examples: Logistic
Regression, Random Forest
• What customers should a
company target with its
marketing campaigns?
• Is this Nigerian prince
committing fraud? (Spam
classification)
• Is this actually Barack
Obama’s Facebook profile and
review on Amazon? (Fraud
detection)
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
45. Regression
• Sub-category of Supervised
Learning
• Regression is a type of
algorithm that predicts a
continuous values.
• How much would a user spend
on a mobile game like
CandyCrush?
• How much would someone
spend on healthcare out of
pocket?
• How many attendees will come
to this event based on past
registration?
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
46. Decision Trees
• Using a tree-like graph or model
of decisions and their possible
consequence.
• Medical Testing (e.g. health
incidences, etc.)
• Genealogy breakdowns (e.g.
eye color, blood type, etc.)
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
47. Deep Learning
• A category of machine learning
algorithms that often use
Artificial Neural Networks to
generate model.
• Image classification
• Language processing
• Audio processing
• Outlier and fraud detection
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple