Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Hadoop and Machine Learning Slide 1 Hadoop and Machine Learning Slide 2 Hadoop and Machine Learning Slide 3 Hadoop and Machine Learning Slide 4 Hadoop and Machine Learning Slide 5 Hadoop and Machine Learning Slide 6 Hadoop and Machine Learning Slide 7 Hadoop and Machine Learning Slide 8 Hadoop and Machine Learning Slide 9 Hadoop and Machine Learning Slide 10 Hadoop and Machine Learning Slide 11 Hadoop and Machine Learning Slide 12 Hadoop and Machine Learning Slide 13 Hadoop and Machine Learning Slide 14 Hadoop and Machine Learning Slide 15 Hadoop and Machine Learning Slide 16 Hadoop and Machine Learning Slide 17 Hadoop and Machine Learning Slide 18 Hadoop and Machine Learning Slide 19 Hadoop and Machine Learning Slide 20 Hadoop and Machine Learning Slide 21 Hadoop and Machine Learning Slide 22 Hadoop and Machine Learning Slide 23 Hadoop and Machine Learning Slide 24 Hadoop and Machine Learning Slide 25 Hadoop and Machine Learning Slide 26
Upcoming SlideShare
How to Interview a Data Scientist
Next
Download to read offline and view in fullscreen.

56 Likes

Share

Download to read offline

Hadoop and Machine Learning

Download to read offline

Slides for the talk by the Cloudera Data Science team on the state of machine learning and Hadoop at NIPS 2011.

Related Books

Free with a 30 day trial from Scribd

See all

Hadoop and Machine Learning

  1. Machine Learning and Hadoop Present and Future Josh Wills, Tom Pierce, and Jeff Hammerbacher Cloudera Data Science Team December 17th, 2011
  2. High Availability for Data Scientists NIPS Copyright 2011 Cloudera Inc. All rights reserved
  3. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  4. Industrial Machine Learning Copyright 2011 Cloudera Inc. All rights reserved
  5. Delta One: Model Evaluation • ML Systems Are One Piece of a Complex System • Well-defined objective functions are the exception • Multiple, often conflicting goals • Weights are fuzzy and shift with business priorities • Pareto optimization is the safest play • Predictive Accuracy Is Only Useful Up to a Point • Examples • Computational advertising • Friend recommendations on social networks Copyright 2011 Cloudera Inc. All rights reserved
  6. Delta Two: Systems Precede Algorithms • Greenfield Projects Hardly Ever Happen • (and don’t usually launch) • Industrial Computational Infrastructure • General-purpose • Cheap • Shared • Constraints Drive Innovation • Vowpal Wabbit Hashing Trick • SETI @ Google Copyright 2011 Cloudera Inc. All rights reserved
  7. Delta Three: Workflow Practice Over Theory Blog Copyright 2011 Cloudera Inc. All rights reserved
  8. Delta Three: Workflow • Optimize the Overall Process • Model fitting is a small piece of the overall flow time • Parallelize everything • Better Features > Better Models • Fast Model Deployment • Common Feature Extraction Logic • Servable Models • Validation as Sanity Checking • Deploy to a small subset of real data and evaluate Copyright 2011 Cloudera Inc. All rights reserved
  9. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  10. Hadoop: It’s Where The Data Is Copyright 2011 Cloudera Inc. All rights reserved
  11. Hadoop Platform: Substrate • Commodity servers • Open Compute • Open source operating system • Linux • Open source configuration management • Puppet • Chef • Coordination service • ZooKeeper Copyright 2011 Cloudera Inc. All rights reserved
  12. Hadoop Platform: Storage • Distributed schema-less storage • HDFS • Ceph • Append-only storage formats and metadata • Avro • RCFile • HCatalog • Mutable key-value storage and metadata • HBase Copyright 2011 Cloudera Inc. All rights reserved
  13. Hadoop Platform: Integration • Tool Access • FUSE • JDBC • ODBC • Data Ingestion • Flume • Sqoop Copyright 2011 Cloudera Inc. All rights reserved
  14. ML and Hadoop: The State of the World Copyright 2011 Cloudera Inc. All rights reserved
  15. Computation: Plain Old MapReduce • Great for: • Data Preparation • Feature Engineering • Model Validation/Evaluation • Works For Certain Model Fitting Problems • Recommendation Systems • Decision Trees (PLANET; Gradient Boosted Decision Trees) • Not A Practical Option for Online Learning • Way More Detail from the KDD 2011 Talk Copyright 2011 Cloudera Inc. All rights reserved
  16. Tools for Data Preparation/Feature Engineering • Languages/Environments • PigLatin • HiveQL • Need to deal with mismatch between offline/online feature generation • Java/Scala APIs • Crunch (Cloudera) • Scoobi (NICTA) • Cascading (Concurrent) • Jaql (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  17. Apache Mahout • The starting place for MapReduce-based machine learning algorithms • Not machine-learning-in-a-box • Custom tweaks/modifications are the rule • A disparate collection of algorithms for: • Recommendations • Clustering • Classification • Frequent Itemset Mining Copyright 2011 Cloudera Inc. All rights reserved
  18. Apache Mahout (cont.) • Best Library: Taste Recommender • Oldest project, most widely-deployed in production • SVD implementation is particularly active • Good Libraries: Online SGD • Does not use MapReduce • Vowpal Rabbit + AllReduce is faster, has L-BFGS option • Roll Your Own Instead: Naïve Bayes • Challenges • “Secret sauce” effect • Delta between Mahout + the cutting edge in ML Copyright 2011 Cloudera Inc. All rights reserved
  19. More Machine Learning Interfaces for Hadoop • Based on MapReduce • SystemML (IBM) • AllReduce (Vowpal Wabbit) • No MapReduce • Spark • R-Based Systems (Augment MapReduce with R) • Segue • RHIPE • RHadoop • Ricardo (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  20. ML and Hadoop: Where Things are Headed Copyright 2011 Cloudera Inc. All rights reserved
  21. MRv2 and YARN • Eliminates JobTracker bottleneck • Separate Resource Manager/Scheduler • Individual jobs have their own task masters • Moves MapReduce into user-land • Enables Hadoop clusters to run all sorts of jobs • MPI (Hamster; MAPREDUCE-2911) • Native BSP (Giraph) • Spark • AllReduce, GraphLab Copyright 2011 Cloudera Inc. All rights reserved
  22. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  23. Machine Learning on Multivariate Time Series • 1e5 writes/sec • Positive events are relatively rare • Feature extraction challenge • May not be clear what the right time horizon is • Tight SLAs • Very high stakes Copyright 2011 Cloudera Inc. All rights reserved
  24. An Academic Language For Feature Engineering • Feature extraction/selection is as important as model fitting • e.g., hierarchical feature representation, impact on training time and experiment design, feature cost modeling, etc. • Academic literature on this problem is sparse and dispersed across multiple fields • NIPS 2003 • HCI, NLP, Information Retrieval, etc. • We need a common language for talking about these problems across disciplines Copyright 2011 Cloudera Inc. All rights reserved
  25. A Broader Ontology For Model Selection • Practical factors that enter into the “best” choice of model… • Data arrival rate • Data volume • Scoring latency • Model refresh time • Robustness/reliability • …in addition to the standard predictive power/simplicity tradeoffs Copyright 2011 Cloudera Inc. All rights reserved
  26. Questions? Want A Job? @josh_wills
  • nisargsshah

    May. 20, 2020
  • DewiChirzah

    May. 11, 2020
  • WendellJefferson

    Feb. 8, 2017
  • DamonChu1

    Nov. 21, 2016
  • akbarboghani

    Sep. 4, 2016
  • leetaohxit

    Jul. 15, 2016
  • ssuser15375d

    Jun. 22, 2016
  • RachanaTanejaBhatia

    May. 8, 2016
  • DrPurshottamHoovayya1

    Apr. 4, 2016
  • rizkysempu

    Mar. 21, 2016
  • rishiarora

    Jan. 24, 2016
  • JuratShayidin

    Dec. 20, 2015
  • JHTan3

    Nov. 6, 2015
  • obsani

    Sep. 8, 2015
  • araz2011

    Sep. 8, 2015
  • MarcosColebrookSantamaria

    Jun. 13, 2014
  • jaimiekwon

    Jun. 7, 2014
  • hlshih

    Jun. 5, 2014
  • fjgirante

    May. 13, 2014
  • kostasdiamantaras

    Jan. 20, 2014

Slides for the talk by the Cloudera Data Science team on the state of machine learning and Hadoop at NIPS 2011.

Views

Total views

61,851

On Slideshare

0

From embeds

0

Number of embeds

18,929

Actions

Downloads

845

Shares

0

Comments

0

Likes

56

×