SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Machine Learning and Hadoop
Present and Future
Josh Wills, Tom Pierce, and Jeff Hammerbacher
Cloudera Data Science Team
December 17th, 2011
High Availability for Data Scientists




                 NIPS




                Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Industrial Machine Learning




   Copyright 2011 Cloudera Inc. All rights reserved
Delta One: Model Evaluation

• ML Systems Are One Piece of a Complex System
• Well-defined objective functions are the exception
   • Multiple, often conflicting goals
   • Weights are fuzzy and shift with business priorities
   • Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point
• Examples
   • Computational advertising
   • Friend recommendations on social networks


                    Copyright 2011 Cloudera Inc. All rights reserved
Delta Two: Systems Precede Algorithms

• Greenfield Projects Hardly Ever Happen
   • (and don’t usually launch)
• Industrial Computational Infrastructure
   • General-purpose
   • Cheap
   • Shared
• Constraints Drive Innovation
   • Vowpal Wabbit Hashing Trick
   • SETI @ Google


                   Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow




                                                                 Practice Over Theory Blog



              Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow

• Optimize the Overall Process
   • Model fitting is a small piece of the overall flow time
   • Parallelize everything
• Better Features > Better Models
• Fast Model Deployment
   • Common Feature Extraction Logic
   • Servable Models
• Validation as Sanity Checking
   • Deploy to a small subset of real data and evaluate


                    Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Hadoop: It’s Where The Data Is




    Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Substrate

• Commodity servers
   • Open Compute
• Open source operating system
   • Linux
• Open source configuration management
   • Puppet
   • Chef
• Coordination service
   • ZooKeeper


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Storage

• Distributed schema-less storage
   • HDFS
   • Ceph
• Append-only storage formats and metadata
   • Avro
   • RCFile
   • HCatalog
• Mutable key-value storage and metadata
   • HBase


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Integration

• Tool Access
   • FUSE
   • JDBC
   • ODBC
• Data Ingestion
   • Flume
   • Sqoop




                   Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: The State of the World




        Copyright 2011 Cloudera Inc. All rights reserved
Computation: Plain Old MapReduce

• Great for:
   • Data Preparation
   • Feature Engineering
   • Model Validation/Evaluation
• Works For Certain Model Fitting Problems
   • Recommendation Systems
   • Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning
• Way More Detail from the KDD 2011 Talk


                   Copyright 2011 Cloudera Inc. All rights reserved
Tools for Data Preparation/Feature Engineering

• Languages/Environments
   • PigLatin
   • HiveQL
   • Need to deal with mismatch between offline/online feature
     generation
• Java/Scala APIs
   •   Crunch (Cloudera)
   •   Scoobi (NICTA)
   •   Cascading (Concurrent)
   •   Jaql (IBM)

                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout

• The starting place for MapReduce-based machine
  learning algorithms
   • Not machine-learning-in-a-box
   • Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:
   •   Recommendations
   •   Clustering
   •   Classification
   •   Frequent Itemset Mining



                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout (cont.)

• Best Library: Taste Recommender
   • Oldest project, most widely-deployed in production
   • SVD implementation is particularly active
• Good Libraries: Online SGD
   • Does not use MapReduce
   • Vowpal Rabbit + AllReduce is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes
• Challenges
   • “Secret sauce” effect
   • Delta between Mahout + the cutting edge in ML

                   Copyright 2011 Cloudera Inc. All rights reserved
More Machine Learning Interfaces for Hadoop

• Based on MapReduce
  • SystemML (IBM)
  • AllReduce (Vowpal Wabbit)
• No MapReduce
  • Spark
• R-Based Systems (Augment MapReduce with R)
  •   Segue
  •   RHIPE
  •   RHadoop
  •   Ricardo (IBM)

                      Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: Where Things are Headed




          Copyright 2011 Cloudera Inc. All rights reserved
MRv2 and YARN

• Eliminates JobTracker bottleneck
   • Separate Resource Manager/Scheduler
   • Individual jobs have their own task masters
• Moves MapReduce into user-land
• Enables Hadoop clusters to run all sorts of jobs
   •   MPI (Hamster; MAPREDUCE-2911)
   •   Native BSP (Giraph)
   •   Spark
   •   AllReduce, GraphLab


                   Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Machine Learning on Multivariate Time Series

 • 1e5 writes/sec
 • Positive events are
   relatively rare
 • Feature extraction
   challenge
 • May not be clear what
   the right time horizon is
 • Tight SLAs
 • Very high stakes

                Copyright 2011 Cloudera Inc. All rights reserved
An Academic Language For Feature Engineering

• Feature extraction/selection is as important as model
  fitting
   • e.g., hierarchical feature representation, impact on training
     time and experiment design, feature cost modeling, etc.
• Academic literature on this problem is sparse and
  dispersed across multiple fields
   • NIPS 2003
   • HCI, NLP, Information Retrieval, etc.
• We need a common language for talking about these
  problems across disciplines

                    Copyright 2011 Cloudera Inc. All rights reserved
A Broader Ontology For Model Selection

• Practical factors that enter into the “best” choice of
  model…
   •   Data arrival rate
   •   Data volume
   •   Scoring latency
   •   Model refresh time
   •   Robustness/reliability
• …in addition to the standard predictive power/simplicity
  tradeoffs


                     Copyright 2011 Cloudera Inc. All rights reserved
Questions?
Want A Job?
  @josh_wills

Contenu connexe

Tendances

machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big dataPoo Kuan Hoong
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningPruet Boonma
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data ScientistNarong Intiruk
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...inside-BigData.com
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningMakine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningAli Alkan
 

Tendances (20)

machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big data
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningMakine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
 

En vedette

A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
The Business Analytics Value Proposition
The Business Analytics Value PropositionThe Business Analytics Value Proposition
The Business Analytics Value PropositionEric Stephens
 

En vedette (20)

A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
The Business Analytics Value Proposition
The Business Analytics Value PropositionThe Business Analytics Value Proposition
The Business Analytics Value Proposition
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 

Similaire à Hadoop and Machine Learning

Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Abe Taha
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Oracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerOracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerJeff Smith
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedCloudera, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)VMware Tanzu
 

Similaire à Hadoop and Machine Learning (20)

Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Oracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerOracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL Server
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
YARN
YARNYARN
YARN
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 

Dernier

Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 

Dernier (20)

Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 

Hadoop and Machine Learning

  • 1. Machine Learning and Hadoop Present and Future Josh Wills, Tom Pierce, and Jeff Hammerbacher Cloudera Data Science Team December 17th, 2011
  • 2. High Availability for Data Scientists NIPS Copyright 2011 Cloudera Inc. All rights reserved
  • 3. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 4. Industrial Machine Learning Copyright 2011 Cloudera Inc. All rights reserved
  • 5. Delta One: Model Evaluation • ML Systems Are One Piece of a Complex System • Well-defined objective functions are the exception • Multiple, often conflicting goals • Weights are fuzzy and shift with business priorities • Pareto optimization is the safest play • Predictive Accuracy Is Only Useful Up to a Point • Examples • Computational advertising • Friend recommendations on social networks Copyright 2011 Cloudera Inc. All rights reserved
  • 6. Delta Two: Systems Precede Algorithms • Greenfield Projects Hardly Ever Happen • (and don’t usually launch) • Industrial Computational Infrastructure • General-purpose • Cheap • Shared • Constraints Drive Innovation • Vowpal Wabbit Hashing Trick • SETI @ Google Copyright 2011 Cloudera Inc. All rights reserved
  • 7. Delta Three: Workflow Practice Over Theory Blog Copyright 2011 Cloudera Inc. All rights reserved
  • 8. Delta Three: Workflow • Optimize the Overall Process • Model fitting is a small piece of the overall flow time • Parallelize everything • Better Features > Better Models • Fast Model Deployment • Common Feature Extraction Logic • Servable Models • Validation as Sanity Checking • Deploy to a small subset of real data and evaluate Copyright 2011 Cloudera Inc. All rights reserved
  • 9. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 10. Hadoop: It’s Where The Data Is Copyright 2011 Cloudera Inc. All rights reserved
  • 11. Hadoop Platform: Substrate • Commodity servers • Open Compute • Open source operating system • Linux • Open source configuration management • Puppet • Chef • Coordination service • ZooKeeper Copyright 2011 Cloudera Inc. All rights reserved
  • 12. Hadoop Platform: Storage • Distributed schema-less storage • HDFS • Ceph • Append-only storage formats and metadata • Avro • RCFile • HCatalog • Mutable key-value storage and metadata • HBase Copyright 2011 Cloudera Inc. All rights reserved
  • 13. Hadoop Platform: Integration • Tool Access • FUSE • JDBC • ODBC • Data Ingestion • Flume • Sqoop Copyright 2011 Cloudera Inc. All rights reserved
  • 14. ML and Hadoop: The State of the World Copyright 2011 Cloudera Inc. All rights reserved
  • 15. Computation: Plain Old MapReduce • Great for: • Data Preparation • Feature Engineering • Model Validation/Evaluation • Works For Certain Model Fitting Problems • Recommendation Systems • Decision Trees (PLANET; Gradient Boosted Decision Trees) • Not A Practical Option for Online Learning • Way More Detail from the KDD 2011 Talk Copyright 2011 Cloudera Inc. All rights reserved
  • 16. Tools for Data Preparation/Feature Engineering • Languages/Environments • PigLatin • HiveQL • Need to deal with mismatch between offline/online feature generation • Java/Scala APIs • Crunch (Cloudera) • Scoobi (NICTA) • Cascading (Concurrent) • Jaql (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 17. Apache Mahout • The starting place for MapReduce-based machine learning algorithms • Not machine-learning-in-a-box • Custom tweaks/modifications are the rule • A disparate collection of algorithms for: • Recommendations • Clustering • Classification • Frequent Itemset Mining Copyright 2011 Cloudera Inc. All rights reserved
  • 18. Apache Mahout (cont.) • Best Library: Taste Recommender • Oldest project, most widely-deployed in production • SVD implementation is particularly active • Good Libraries: Online SGD • Does not use MapReduce • Vowpal Rabbit + AllReduce is faster, has L-BFGS option • Roll Your Own Instead: Naïve Bayes • Challenges • “Secret sauce” effect • Delta between Mahout + the cutting edge in ML Copyright 2011 Cloudera Inc. All rights reserved
  • 19. More Machine Learning Interfaces for Hadoop • Based on MapReduce • SystemML (IBM) • AllReduce (Vowpal Wabbit) • No MapReduce • Spark • R-Based Systems (Augment MapReduce with R) • Segue • RHIPE • RHadoop • Ricardo (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 20. ML and Hadoop: Where Things are Headed Copyright 2011 Cloudera Inc. All rights reserved
  • 21. MRv2 and YARN • Eliminates JobTracker bottleneck • Separate Resource Manager/Scheduler • Individual jobs have their own task masters • Moves MapReduce into user-land • Enables Hadoop clusters to run all sorts of jobs • MPI (Hamster; MAPREDUCE-2911) • Native BSP (Giraph) • Spark • AllReduce, GraphLab Copyright 2011 Cloudera Inc. All rights reserved
  • 22. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 23. Machine Learning on Multivariate Time Series • 1e5 writes/sec • Positive events are relatively rare • Feature extraction challenge • May not be clear what the right time horizon is • Tight SLAs • Very high stakes Copyright 2011 Cloudera Inc. All rights reserved
  • 24. An Academic Language For Feature Engineering • Feature extraction/selection is as important as model fitting • e.g., hierarchical feature representation, impact on training time and experiment design, feature cost modeling, etc. • Academic literature on this problem is sparse and dispersed across multiple fields • NIPS 2003 • HCI, NLP, Information Retrieval, etc. • We need a common language for talking about these problems across disciplines Copyright 2011 Cloudera Inc. All rights reserved
  • 25. A Broader Ontology For Model Selection • Practical factors that enter into the “best” choice of model… • Data arrival rate • Data volume • Scoring latency • Model refresh time • Robustness/reliability • …in addition to the standard predictive power/simplicity tradeoffs Copyright 2011 Cloudera Inc. All rights reserved
  • 26. Questions? Want A Job? @josh_wills