SlideShare une entreprise Scribd logo
1  sur  54
Big Data [sorry] 
       Data Science:
What Does a Data Scientist Do?	


                                                                 Carlos Somohano	

                                                         Founder Data Science London	

                                                                         @ds_ldn	

                                                            datasciencelondon.org	





The Cloud and Big Data: HDInsight on Azure London 25/01/13
Man on the Moon – 1,969
Man on the Moon – Small Data! 	


Computer Program	

          Apollo X1	

              Man on the Moon	

Date: 1,969	

               Speed: 3,500 km/hour	

   Distance: 356,000 Km	

64 Kb, 2Kb RAM, Fortran	

   Weight: 13,500 kg	

      Never been there before	

Must work 1st time	

        Lots of complex data	

   Must return to Earth
Apollo XI, 1969	

    SkyDive Stratos, 2012	


       64 Kb	

            Tens of Gigabytes	





Think About It – We live in Crazy Times!
Big Data is not about Data Volume
What is Big Data? IT mumbo-jumbo	



  A fashionable term typically used by some IT
  vendors to remarket old fashioned software 
  hardware
What is Big Data? The n-Vs	

        Volume …	

        Variety …	

        Velocity …	

        (add your own V here…)	


        So What?
Change! Water Cooler Chat	

We need to parallelize data operations but it’s too costly  complex …	


The business can’t get access to all the relevant data, we need external data…	


We can’t match customer master data to live customer interactions…	


We can’t just force everything into a star-schema…	


These BI reports and charts don’t tell us anything we didn’t know…	


We are missing the ETL window, the data we needed didn’t arrive on time…	


We can’t predict with confidence if we can’t explore data  develop our own models
What is Big Data? Force of Change	



 Big Data forces you to change the way you collect,
 store, manage, analyze and visualize data
Crude Oil
Big Data = Crude Oil [not New Oil]	


Think data as ‘crude oil.’	


Big Data is about extracting the ‘crude oil,’
transporting it in ‘mega-tankers,’ siphoning it through
‘pipelines,’ and storing it in massive ‘silos’… 	


All ‘this’ is about IT Big Data… fine and well…	


… BUT
You need to refine the ‘crude oil’	


       Enter Data Science…
The Science [and Art] of… 	


	

Discovering what we don’t know from data	


	

Obtaining predictive, actionable insight from data	


	

Creating Data Products that have business impact now	


	

Communicating relevant business stories from data	


	

Building confidence in decisions that drive business value
Brief History of Data Science	

6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism  Empiricism… 	

1974 – Peter Naur @UoC Datalogy  Data Science	

2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 	

2002 – Committee on Data for Science  Technology (CODATA) 	

2003 – Journal of Data Science 	

2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 	

2010 – Drew Conway @NYU The Data Science Venn Diagram	

2010 – Hillary Mason  Chris Wiggins @Dataists “	

2010 – Mike Loukadis @O’Reilly “What is Data Science?” 	

2011 – DJ Patil @LinkedIn data scientist vs. data analyst
Jeff Hammerbacher, 2009	

“... on any given day, a team member could author a
multistage processing pipeline in Python, 	

	

design a hypothesis test, perform a regression analysis
over data samples with R, 	

	

design and implement an algorithm for some data-
intensive product or service in Hadoop, or
communicate the results of our analyses to other
members of the organization.
Mike Loukides, 2010	


Data science enables the creation of data
products.	

	

Whether... data is search terms, voice samples, or
product reviews,... users are in a feedback loop in
which they contribute to the products they use. 	

	

That's the beginning of data science.
Hilary Mason  Chris Wiggins,2010	


  Data science is clearly a blend of the hackers’ arts, statistics
  and machine learning...; 	

  	

  and the expertise in mathematics and the domain of the
  data for the analysis to be interpretable... 	

  	

  It requires creative decisions and open-mindedness in a
  scientific context.
Drew Conway, 2010
DJ Patil, 2011	

”We realized that as our organizations grew, we both had to figure out
what to call the people on our teams. Business analyst” and Data analyst”
seemed too limiting. 	

   	

The focus of our teams was to work on data applications that would have
an immediate and massive impact on the business. 	

	

The term that seemed to fit best was data scientist: those who use both
data and science to create something new”
What is a Data Scientist?
The Duck – Billed Platypus	





       The Data Scientist – Billed Platypus
The Platypus – Billed Data Scientist	

                                                   Machine Learning	

     Hacking	

                                                        Statistics	




                                                                          Math	

                                                    Visualization	

                Science	


   Programming	

                 Data Mining	



                    The Data Scientist – Billed Platypus
Josh Wills, 2012
Class DataScientist {	

 Is skeptical, curious. Has inquisitive mind 	

 Knows Machine Learning, Statistics, Probability	

 Applies Scientific Method. Runs Experiments	

 Is good at Coding  Hacking	

 Able to deal with IT Data Engineering	

 Knows how to build data products	

 Able to find answers to known unknowns	

 Tells relevant business stories from data	

 Has Domain Knowledge 	


}
What Does a Data Scientist Do?
10 Things [most] Data Scientists Do	

      1  Ask Good Questions. What is What… 	

           …we don’t know?	

           …we’d like to know?	

      2  Define and Test an Hypothesis. Run experiments	

      3  Scoop, Scrap, Sink,  Sample Business Relevant Data	

      4  Munge and Wrestle Data. Tame Data	

      5  Explore Data, Discover Data Playfully. Discover unknowns.	

      6  Model Data. Model Algorithms.	

      7  Understand Data Relationships	

      8  Tell the Machine How to Learn from Data	

      9  Create Data Products that Deliver Actionable Insight 	

      10  Tell Relevant Business Stories from Data
[Sort of a] Data Scientist Toolkit	

   §  Java, R, Python… (bonus: Clojure, Haskell, Scala)	

   §  Hadoop, HDFS  MapReduce… (bonus: Spark, Storm)	

   §  HBase, Pig  Hive… (bonus: Shark, Impala, Cascalog)	

   §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) 	

   §  SQL, RDBMS, DW, OLAP…	

   §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas)	

   §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny…	

   §  SPSS, Matlab, SAS… (the enterprise man)	

   §  NoSQL, Mongo DB, Couchbase, Cassandra…	

   §  And Yes! … MS-Excel: the most used, most underrated DS tool
Foundations of Data Science
[Some] Data Science Principles	

    1    Socio-Technical Systems (STS) are complex!	

    2    Data is never at rest	

    3    Data is dirty, deal with it	

    4    SVoT = LOL!	

    5    Data munging  data wrestling  70% time	

    6    Simplification. Reduction. Distillation	

    7    Curiosity. Empiricism. Skepticism
Knowns  Unknowns	


There are known knowns. These are things we know
that we know. 	

There are known unknowns. That is to say, there are
things that we know we don't know.	

But there are also unknown unknowns. There are
things we don't know we don't know	

                                    Donald Rumsfeld
DIKUW FTW!	

  D                      I                      K                       U                      W

 Data              Information               Knowledge           Understanding              Wisdom


                                      PAST                                                   FUTURE

Data Engineer	

    Data Analyst	

                          Data Miner	

      Data Scientist	


        Raw                  What               How to                  Why                   When

    Numbers            Description            Experience          Cause  Effect           Prediction

     Letters             Context                 Tested                Proven             What’s best

                                                                        Known               Unknown
     Symbols          Relationship             Instruction              Unknowns	

         Unknowns	

                      Known Knowns	

      Signals            Reports               Programs                models
Data Discovery	



                                      Data Analyst	



                                                        Data Scientist	





The new reality for Business Intelligence and Big Data, Applied Data Labs
Data Models vs. Algorithmic Models	

           Data Modeling	

                                  VS.	

          Algorithmic Modeling	


 Y ß F( X, random noise, parameters) 	

                                 Y ß 	

        Black Box	

         ß X	

                                                                                         Random Forests	





          We understand the world	

                                            We don’t understand the world	

    How well ‘my data model’ works	

                                       The world produces data in a black-box 	

    Statisticians, Data Analysts, Data Miners	

                            Data Scientists	

    Linear Regression	

                                                    Machine Learning, AI  Neural Nets	

    Logistic Regression	

                                                  Random Forests, SVM, GBT	

    Known Distributions	

                                                  Unknown Multivariate Distributions	

    Confidence Intervals	

                                                  Iterative	

    Predictor Variables  Goodness of Fit	

                                Predictive Accuracy	

     	

                                                                     	

     	


                                             “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
Learning from Data is Tricky	

      Statistical vs. Machine Learning	

      Supervised vs. Unsupervised Learning	

      Induction vs. Deduction	

      Sampling  Confidence Intervals 	

      Probability  Distribution	

      Deviation  Variance	

      Correlation vs. Causation	

      Causation  Prediction
More Data or Better Models?	

More Data Beats Better Algorithms, Omar Tawakoi @BlueKai	

	

Better Algorithms Beat More Data, Mark Torrance @RocketFuel	

	

More Data or Better Models, Xavier Armitrain @Netflix	

	

On Chomsky  2 Cultures of Statistical Learning, Peter Norvig @Google 	

	

Specialist Knowledge is Useless  Unhelpful, Jeremy Howard @Kaggle
Data Science Process – An approach
Data Science Process - 1	

      1  Known Unknowns? 	

      2  We’d like to know…?	

      3  Outcomes?	

      4  What Data?	

      5  Hypothesis?	




         The World 	

            Ingest Raw Data	

     Munch Data	

           The Dataset	

Product Manufactured	

           Transactions	

        MapReduce	

            Independency?	

Goods shipped	

                  Web-Scraping	

        ETL, ELT	

             Correlation?	

Product purchased	

              Web-clicks  logs	

   Data Wrangle 	

        Covariance?	

Phone Calls Made	

               Sensor Data	

         Data Cleansing	

       Causality?	

Energy Consumed 	

               Mobile Data	

         Data Jujitsu	

         Dimensionality?	

Fraud Committed	

                Docs, Emails, XLS	

   Dim Reduction	

        Missing Values?	

Repair Requested	

               Social Feeds, RSS	

   Sample	

               Relevant?	

System 	

                        Flume  Sink HDFS	

   Select, Join, Bind
Data Science Process - II	

The Dataset	

   Explore Data	

                 Represent Data	

                 Discover Data	

                                                                    Deliver Insight 	

                 Learn From Data	

              Data Product	

                                                                    Visualize Insight 	

                 Description  Inference	

      Objectives	

                 Data  Algorithm Models	

      Levers	

          Actionable	

                 Machine Learning	

             Modeling	

        Predictive	

                 Networks  Graphs	

            Simulation	

      Immediate Impact	

                 Regression  Prediction	

      Optimization	

    Business Value	

                 Classification  Clustering	

   Visualization	

   Easy to explain	

                 Experiments  Iteration
What is a Data Product?
A Data Product Is… 	

… Curated and crafted from raw data	

… A result of exploration and iterations	

… A machine that learns from data 	

… An answer to known unknowns or unknown unknowns	

… A mechanism that triggers immediate business value	

… A probabilistic window of future events or behavior
Data Jiu-Jitsu	

                                      Data	


                                                    Jiu Jitsu Fight 	

                     $$$$	


                                                                     Data Product	

 Data Scientist	




Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value	

                                                                             (DJ Patil @LinkedIn)
Developing Data Products	



          Objectives	

                          Levers	

                           Data	

                         Models	


       What Outcome                       What Inputs Can                    What Data Can                     How the Levers
       Am I Trying to                     We Control?	

                     We Collect?	

                    Influence the
       Achieve?	

                                                                                             Objectives	





Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
Objective-Based Data Products	

What Outcome Am I                                                                                                          Actionable
Trying to Achieve?	

                                                                                                      Outcome	



                            Data 	

                 Modeler	

                 Simulator	

                 Optimizer	



                                                 The Model Assembly Line 	





Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
5 Great Data Products
Customer Lifecycle Value	

             Optimize CLV	

                         Product Recommendations	

                               Visualizer	




                            Data 	

                  Modeler	

                Simulator	

                 Optimizer	



                                  1  Products the customer may like	

                                  2  Price Elasticity	

                                  3  Probability of Purchase w/o Recommendation	

                                  4  Purchase Sequence	

                                  5  Causality Model	

                                  6  Patience Model	



Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
Automated Fruits Procurement	

                                Confirm Purchase Orders	

                                In less than 2 hours	



                                Safety Stock levels?	

                                Demand vs Stock?	

                                Price vs. Demand?	

 12,000 stores	

               Anomalies?	

 300 Fruits	

                  Fruit Shortages?	

 Avg. Shelf life  3 days 	

   Fruit Write-offs?	



 Adapted from Blueyonder
Strawberries  the Weather	


                                         No sales vs X,XXX sales predicted	

Why these huge stock write-offs?	





                                       A Predictive Model that calculates
                                       strawberry purchases based on	

                                          	

                                          Weather forecast	

   Sudden increase in temperature	

      Store temperature	

                                          Freezer sensor data	

                                          Remaining stock per shelf live	

                                          Sales TPoS feeds	

                                          Web searches, social mentions 	


   Adapted from Blueyonder
Personalized Social Recommendations	


 Collaborative Filtering: Matching Skills to People	

             Prediction: Personalized Skills Recommendation	





 Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
Colas- In Which US State I Invest Mktg. $? 	


            What the Business Analyst Sent	





                                                What the Data Scientist did…
The Great Pop vs. Soda Page	





              http://www.popvssoda.com/
Pop vs. Soda vs. Coke
Raw Data Will Drive You Car
Interested in Data Science?	

Join our community	

    http://www.meetup.com/Data-Science-London/	

Follow us on Twitter 	

    @ds_ldn	

Check out our blog	

    	

http://datasciencelondon.org
Thanks for your time

Contenu connexe

Tendances

Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Simplilearn
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 

Tendances (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Data Science
Data ScienceData Science
Data Science
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big Data
Big DataBig Data
Big Data
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data science ppt
Data science pptData science ppt
Data science ppt
 
Data science
Data science Data science
Data science
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Data Science
Data ScienceData Science
Data Science
 
Big Data
Big DataBig Data
Big Data
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 

En vedette

Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 

En vedette (20)

Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 

Similaire à Big Data [sorry] & Data Science: What Does a Data Scientist Do?

Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCTJ Stalcup
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
How Your Data Can Predict The Future
How Your Data Can Predict The FutureHow Your Data Can Predict The Future
How Your Data Can Predict The FutureBecky Wang
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Scienceds4good
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data scienceThinkful
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Predictive modelling with azure ml
Predictive modelling with azure mlPredictive modelling with azure ml
Predictive modelling with azure mlKoray Kocabas
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big dataAndrew Clegg
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data ScienceSanghamitra Deb
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptxRUDRAPRASADSABAR
 
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Anna Kuhn
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 

Similaire à Big Data [sorry] & Data Science: What Does a Data Scientist Do? (20)

Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Data science
Data scienceData science
Data science
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
How Your Data Can Predict The Future
How Your Data Can Predict The FutureHow Your Data Can Predict The Future
How Your Data Can Predict The Future
 
Big data
Big dataBig data
Big data
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Predictive modelling with azure ml
Predictive modelling with azure mlPredictive modelling with azure ml
Predictive modelling with azure ml
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptx
 
Data mining
Data miningData mining
Data mining
 
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 

Plus de Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 

Plus de Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 

Dernier

Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 

Dernier (20)

Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

  • 1. Big Data [sorry] Data Science: What Does a Data Scientist Do? Carlos Somohano Founder Data Science London @ds_ldn datasciencelondon.org The Cloud and Big Data: HDInsight on Azure London 25/01/13
  • 2. Man on the Moon – 1,969
  • 3. Man on the Moon – Small Data! Computer Program Apollo X1 Man on the Moon Date: 1,969 Speed: 3,500 km/hour Distance: 356,000 Km 64 Kb, 2Kb RAM, Fortran Weight: 13,500 kg Never been there before Must work 1st time Lots of complex data Must return to Earth
  • 4. Apollo XI, 1969 SkyDive Stratos, 2012 64 Kb Tens of Gigabytes Think About It – We live in Crazy Times!
  • 5. Big Data is not about Data Volume
  • 6. What is Big Data? IT mumbo-jumbo A fashionable term typically used by some IT vendors to remarket old fashioned software hardware
  • 7. What is Big Data? The n-Vs Volume … Variety … Velocity … (add your own V here…) So What?
  • 8. Change! Water Cooler Chat We need to parallelize data operations but it’s too costly complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with confidence if we can’t explore data develop our own models
  • 9. What is Big Data? Force of Change Big Data forces you to change the way you collect, store, manage, analyze and visualize data
  • 11. Big Data = Crude Oil [not New Oil] Think data as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’ transporting it in ‘mega-tankers,’ siphoning it through ‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… fine and well… … BUT
  • 12. You need to refine the ‘crude oil’ Enter Data Science…
  • 13. The Science [and Art] of… Discovering what we don’t know from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building confidence in decisions that drive business value
  • 14. Brief History of Data Science 6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism Empiricism… 1974 – Peter Naur @UoC Datalogy Data Science 2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 2002 – Committee on Data for Science Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
  • 15. Jeff Hammerbacher, 2009 “... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data- intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.
  • 16. Mike Loukides, 2010 Data science enables the creation of data products. Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science.
  • 17. Hilary Mason Chris Wiggins,2010 Data science is clearly a blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientific context.
  • 19. DJ Patil, 2011 ”We realized that as our organizations grew, we both had to figure out what to call the people on our teams. Business analyst” and Data analyst” seemed too limiting. The focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new”
  • 20. What is a Data Scientist?
  • 21. The Duck – Billed Platypus The Data Scientist – Billed Platypus
  • 22. The Platypus – Billed Data Scientist Machine Learning Hacking Statistics Math Visualization Science Programming Data Mining The Data Scientist – Billed Platypus
  • 24. Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiments Is good at Coding Hacking Able to deal with IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
  • 25. What Does a Data Scientist Do?
  • 26. 10 Things [most] Data Scientists Do 1  Ask Good Questions. What is What… …we don’t know? …we’d like to know? 2  Define and Test an Hypothesis. Run experiments 3  Scoop, Scrap, Sink, Sample Business Relevant Data 4  Munge and Wrestle Data. Tame Data 5  Explore Data, Discover Data Playfully. Discover unknowns. 6  Model Data. Model Algorithms. 7  Understand Data Relationships 8  Tell the Machine How to Learn from Data 9  Create Data Products that Deliver Actionable Insight 10  Tell Relevant Business Stories from Data
  • 27. [Sort of a] Data Scientist Toolkit §  Java, R, Python… (bonus: Clojure, Haskell, Scala) §  Hadoop, HDFS MapReduce… (bonus: Spark, Storm) §  HBase, Pig Hive… (bonus: Shark, Impala, Cascalog) §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) §  SQL, RDBMS, DW, OLAP… §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… §  SPSS, Matlab, SAS… (the enterprise man) §  NoSQL, Mongo DB, Couchbase, Cassandra… §  And Yes! … MS-Excel: the most used, most underrated DS tool
  • 29. [Some] Data Science Principles 1  Socio-Technical Systems (STS) are complex! 2  Data is never at rest 3  Data is dirty, deal with it 4  SVoT = LOL! 5  Data munging data wrestling 70% time 6  Simplification. Reduction. Distillation 7  Curiosity. Empiricism. Skepticism
  • 30. Knowns Unknowns There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know Donald Rumsfeld
  • 31. DIKUW FTW! D I K U W Data Information Knowledge Understanding Wisdom PAST FUTURE Data Engineer Data Analyst Data Miner Data Scientist Raw What How to Why When Numbers Description Experience Cause Effect Prediction Letters Context Tested Proven What’s best Known Unknown Symbols Relationship Instruction Unknowns Unknowns Known Knowns Signals Reports Programs models
  • 32. Data Discovery Data Analyst Data Scientist The new reality for Business Intelligence and Big Data, Applied Data Labs
  • 33. Data Models vs. Algorithmic Models Data Modeling VS. Algorithmic Modeling Y ß F( X, random noise, parameters) Y ß Black Box ß X Random Forests We understand the world We don’t understand the world How well ‘my data model’ works The world produces data in a black-box Statisticians, Data Analysts, Data Miners Data Scientists Linear Regression Machine Learning, AI Neural Nets Logistic Regression Random Forests, SVM, GBT Known Distributions Unknown Multivariate Distributions Confidence Intervals Iterative Predictor Variables Goodness of Fit Predictive Accuracy “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
  • 34. Learning from Data is Tricky Statistical vs. Machine Learning Supervised vs. Unsupervised Learning Induction vs. Deduction Sampling Confidence Intervals Probability Distribution Deviation Variance Correlation vs. Causation Causation Prediction
  • 35. More Data or Better Models? More Data Beats Better Algorithms, Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netflix On Chomsky 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless Unhelpful, Jeremy Howard @Kaggle
  • 36. Data Science Process – An approach
  • 37. Data Science Process - 1 1  Known Unknowns? 2  We’d like to know…? 3  Outcomes? 4  What Data? 5  Hypothesis? The World Ingest Raw Data Munch Data The Dataset Product Manufactured Transactions MapReduce Independency? Goods shipped Web-Scraping ETL, ELT Correlation? Product purchased Web-clicks logs Data Wrangle Covariance? Phone Calls Made Sensor Data Data Cleansing Causality? Energy Consumed Mobile Data Data Jujitsu Dimensionality? Fraud Committed Docs, Emails, XLS Dim Reduction Missing Values? Repair Requested Social Feeds, RSS Sample Relevant? System Flume Sink HDFS Select, Join, Bind
  • 38. Data Science Process - II The Dataset Explore Data Represent Data Discover Data Deliver Insight Learn From Data Data Product Visualize Insight Description Inference Objectives Data Algorithm Models Levers Actionable Machine Learning Modeling Predictive Networks Graphs Simulation Immediate Impact Regression Prediction Optimization Business Value Classification Clustering Visualization Easy to explain Experiments Iteration
  • 39. What is a Data Product?
  • 40. A Data Product Is… … Curated and crafted from raw data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
  • 41. Data Jiu-Jitsu Data Jiu Jitsu Fight $$$$ Data Product Data Scientist Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value (DJ Patil @LinkedIn)
  • 42. Developing Data Products Objectives Levers Data Models What Outcome What Inputs Can What Data Can How the Levers Am I Trying to We Control? We Collect? Influence the Achieve? Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 43. Objective-Based Data Products What Outcome Am I Actionable Trying to Achieve? Outcome Data Modeler Simulator Optimizer The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 44. 5 Great Data Products
  • 45. Customer Lifecycle Value Optimize CLV Product Recommendations Visualizer Data Modeler Simulator Optimizer 1  Products the customer may like 2  Price Elasticity 3  Probability of Purchase w/o Recommendation 4  Purchase Sequence 5  Causality Model 6  Patience Model Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 46. Automated Fruits Procurement Confirm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? 12,000 stores Anomalies? 300 Fruits Fruit Shortages? Avg. Shelf life 3 days Fruit Write-offs? Adapted from Blueyonder
  • 47. Strawberries the Weather No sales vs X,XXX sales predicted Why these huge stock write-offs? A Predictive Model that calculates strawberry purchases based on Weather forecast Sudden increase in temperature Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
  • 48. Personalized Social Recommendations Collaborative Filtering: Matching Skills to People Prediction: Personalized Skills Recommendation Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
  • 49. Colas- In Which US State I Invest Mktg. $? What the Business Analyst Sent What the Data Scientist did…
  • 50. The Great Pop vs. Soda Page http://www.popvssoda.com/
  • 51. Pop vs. Soda vs. Coke
  • 52. Raw Data Will Drive You Car
  • 53. Interested in Data Science? Join our community http://www.meetup.com/Data-Science-London/ Follow us on Twitter @ds_ldn Check out our blog http://datasciencelondon.org