SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
BUILDING
DATA SCIENCE TEAMS
FROM SCRATCH
Klaas Bosteels @klbostee
MY CAREER PATH SO FAR


2007: Began working with big data as PhD student
2009: Embarked on a data science career at Last.fm
2011: Joined Massive Media as Lead Data Scientist


          Data company at heart; one of the earliest Hadoop adopters world-
          wide; inventors of Ketama; organised first “NoSQL” meetup in SF.

                                   Huge audience and tremendous potential,
                                   but data science newcomer at the time.
MY TEAM AT MASSIVE MEDIA




                    + interns!
Currently 4 permanent people, so not huge just yet
Relatively big and growing faster than anticipated though
OUR MISSION IS HELPING THE COMPANY...


    MEASURE metrics dashboards
    EVALUATE data-driven testing
    DECIDE       ad hoc data insights
    IMPROVE      e.g. abuse detection
    EXTEND       new product features
    PROMOTE PR via data porn
OUR MISSION IS HELPING THE COMPANY...


                                     MEASURE metrics dashboards
higher risk but bigger returns




                                     EVALUATE data-driven testing
                                     DECIDE       ad hoc data insights
                                     IMPROVE      e.g. abuse detection
                                     EXTEND       new product features
                                     PROMOTE PR via data porn
OUR MISSION IS HELPING THE COMPANY...


                                     MEASURE metrics dashboards
higher risk but bigger returns




                                                                         very wide range of tasks
                                     EVALUATE data-driven testing
                                     DECIDE       ad hoc data insights
                                     IMPROVE      e.g. abuse detection
                                     EXTEND       new product features
                                     PROMOTE PR via data porn
STEP 1
FOLLOW THE MONEY

                   photo by Chris Isherwood
BOOTSTRAP BY SAVING OR GAINING MONEY


You need to get some capital to get started

Saving money tends to be easier in practice

Real-world example:

 • Analyzing CDN logs unveiled abuse
 • Stopping the abuse greatly reduced the bills
STEP 2
EMBRACE HADOOP

                 photo by Doug Kukurudza
HADOOP


Not the holy grail, but deserves a central role

It has a vibrant community and is proven to be:

  ECONOMICAL runs on commodity hardware
  SCALABLE              smart distributed processing
  MAINTAINABLE very robust and fault-tolerant
  FLEXIBLE              predefined schemas not required
STEP 3
BUILD DASHBOARDS

                   photo by Dawn Hopkins
STATS PIPELINE BASED ON HADOOP


              Log collector

                 HDFS

               MapReduce

Dashboards       HBase
                                       in batches
                                       continuous
STATS PIPELINE BASED ON HADOOP

 Cfr. “lambda
architecture”
                Log collector
  coined by
@nathanmarz        HDFS
                                    Realtime
                                   processing
                 MapReduce

 Dashboards        HBase
                                         in batches
                                         continuous
STATS PIPELINE BASED ON HADOOP

 Cfr. “lambda
architecture”
                Log collector
  coined by
@nathanmarz        HDFS
                                    Realtime
   Ad-hoc                          processing
   results       MapReduce

 Dashboards        HBase
                                         in batches
                                         continuous
PYTHON IS AN AWESOME JACK OF ALL TRADES


It is great for building dashboards:
  • Hadoop support: Dumbo, Python UDFs for Pig, ...
  • Several amazing web frameworks, e.g. Flask
  • Likewise for drawing graphs, e.g. PyCairo
And it covers many other data science needs as well:
  • Scripting, prototyping and full-blown programming
  • NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...
STEP 4
ASSEMBLE A TEAM

                  photo by Jean-François Schmitz
THE SECRET IS IN THE MIX


Hadoop’s tricks also apply to data science teams
  • Avoid specialisation to allow easy distribution and scaling
  • Exploit data locality by hiring people with wide skill set
Great Data Scientists have the right mix of skills
  • Hackers with solid technical background
  • Analytical mind that knows statistics and machine learning
  • Clever and creative in everything they do
STEP 5
EXPLORE & INNOVATE

                     photo by NASAr
SOME TIPS AND TRICKS


Dare to fail and/or start from estimates
Introduce data exploration/innovation days
  • Basically 20% time devoted to playing with data
  • Incorporate brainstorming
  • Encourage collaboration
Communicate findings to the rest of the company
  • Fun and silliness are allowed
  • Prototype early and often
FIVE SIMPLE STEPS IS ALL IT TAKES


1   FOLLOW THE MONEY

2   EMBRACE HADOOP

3   BUILD DASHBOARDS

4   ASSEMBLE A TEAM

5   EXPLORE & INNOVATE
FIVE SIMPLE STEPS IS ALL IT TAKES


1   FOLLOW THE MONEY

2   EMBRACE HADOOP
                                 Thanks!
3   BUILD DASHBOARDS
                                Questions?
4   ASSEMBLE A TEAM

5   EXPLORE & INNOVATE

Contenu connexe

Tendances

Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Walmart Big Data Expo
Walmart Big Data ExpoWalmart Big Data Expo
Walmart Big Data ExpoBigDataExpo
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 

Tendances (20)

Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Walmart Big Data Expo
Walmart Big Data ExpoWalmart Big Data Expo
Walmart Big Data Expo
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 

En vedette

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC
Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYCDan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC
Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYCMLconf
 
Five awesome django tutorials - Open Data Science
Five awesome django tutorials - Open Data ScienceFive awesome django tutorials - Open Data Science
Five awesome django tutorials - Open Data Scienceopendatascience
 
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...VMware Tanzu
 
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLDan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLMLconf
 
Challenges of managing Data Science Project
Challenges of managing Data Science ProjectChallenges of managing Data Science Project
Challenges of managing Data Science ProjectLamjed Ben Jabeur
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectEugene Mandel
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsMichał Łopuszyński
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologySergey Shelpuk
 
Notulen presentatie Ikea door Sjraar Billekens
Notulen presentatie Ikea door Sjraar BillekensNotulen presentatie Ikea door Sjraar Billekens
Notulen presentatie Ikea door Sjraar BillekensDag van het MKB
 
Remarkable Matters: Remarkable Mobile Design
Remarkable Matters: Remarkable Mobile DesignRemarkable Matters: Remarkable Mobile Design
Remarkable Matters: Remarkable Mobile DesignTable19
 
Icta2016 ghana transport conference final programme
Icta2016 ghana transport conference final programmeIcta2016 ghana transport conference final programme
Icta2016 ghana transport conference final programmeDr. Simon Oladele
 
Causes of air pollution
Causes of air pollutionCauses of air pollution
Causes of air pollutionandare2
 

En vedette (17)

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC
Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYCDan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC
Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC
 
Five awesome django tutorials - Open Data Science
Five awesome django tutorials - Open Data ScienceFive awesome django tutorials - Open Data Science
Five awesome django tutorials - Open Data Science
 
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
 
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLDan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
 
Challenges of managing Data Science Project
Challenges of managing Data Science ProjectChallenges of managing Data Science Project
Challenges of managing Data Science Project
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science Project
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Notulen presentatie Ikea door Sjraar Billekens
Notulen presentatie Ikea door Sjraar BillekensNotulen presentatie Ikea door Sjraar Billekens
Notulen presentatie Ikea door Sjraar Billekens
 
Remarkable Matters: Remarkable Mobile Design
Remarkable Matters: Remarkable Mobile DesignRemarkable Matters: Remarkable Mobile Design
Remarkable Matters: Remarkable Mobile Design
 
Xtresia
XtresiaXtresia
Xtresia
 
La Economía colonial II Parte
La Economía colonial  II ParteLa Economía colonial  II Parte
La Economía colonial II Parte
 
Icta2016 ghana transport conference final programme
Icta2016 ghana transport conference final programmeIcta2016 ghana transport conference final programme
Icta2016 ghana transport conference final programme
 
Causes of air pollution
Causes of air pollutionCauses of air pollution
Causes of air pollution
 
eTOX
eTOXeTOX
eTOX
 
Evaluation question 1 - John Glen
Evaluation   question 1 - John GlenEvaluation   question 1 - John Glen
Evaluation question 1 - John Glen
 
Beyond Mobility
Beyond MobilityBeyond Mobility
Beyond Mobility
 

Similaire à Back to Square One: Building a Data Science Team from Scratch

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceEdureka!
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRushtempledf
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Blueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biBlueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biDataWorks Summit
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2Dmitri Apassov
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured DataDataWorks Summit
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
What is Hadoop & its Use cases-PromtpCloud
What is Hadoop & its Use cases-PromtpCloudWhat is Hadoop & its Use cases-PromtpCloud
What is Hadoop & its Use cases-PromtpCloudPromptCloud
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Inside Analysis
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 

Similaire à Back to Square One: Building a Data Science Team from Scratch (20)

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRush
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Blueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biBlueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and bi
 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2Big data sketch-and-possible-usecases2
Big data sketch-and-possible-usecases2
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured Data
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
What is Hadoop & its Use cases-PromtpCloud
What is Hadoop & its Use cases-PromtpCloudWhat is Hadoop & its Use cases-PromtpCloud
What is Hadoop & its Use cases-PromtpCloud
 
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
Drinking from the Fire Hose: Practical Approaches to Big Data Preparation and...
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 

Back to Square One: Building a Data Science Team from Scratch

  • 1. BUILDING DATA SCIENCE TEAMS FROM SCRATCH Klaas Bosteels @klbostee
  • 2. MY CAREER PATH SO FAR 2007: Began working with big data as PhD student 2009: Embarked on a data science career at Last.fm 2011: Joined Massive Media as Lead Data Scientist Data company at heart; one of the earliest Hadoop adopters world- wide; inventors of Ketama; organised first “NoSQL” meetup in SF. Huge audience and tremendous potential, but data science newcomer at the time.
  • 3. MY TEAM AT MASSIVE MEDIA + interns! Currently 4 permanent people, so not huge just yet Relatively big and growing faster than anticipated though
  • 4. OUR MISSION IS HELPING THE COMPANY... MEASURE metrics dashboards EVALUATE data-driven testing DECIDE ad hoc data insights IMPROVE e.g. abuse detection EXTEND new product features PROMOTE PR via data porn
  • 5. OUR MISSION IS HELPING THE COMPANY... MEASURE metrics dashboards higher risk but bigger returns EVALUATE data-driven testing DECIDE ad hoc data insights IMPROVE e.g. abuse detection EXTEND new product features PROMOTE PR via data porn
  • 6. OUR MISSION IS HELPING THE COMPANY... MEASURE metrics dashboards higher risk but bigger returns very wide range of tasks EVALUATE data-driven testing DECIDE ad hoc data insights IMPROVE e.g. abuse detection EXTEND new product features PROMOTE PR via data porn
  • 7. STEP 1 FOLLOW THE MONEY photo by Chris Isherwood
  • 8. BOOTSTRAP BY SAVING OR GAINING MONEY You need to get some capital to get started Saving money tends to be easier in practice Real-world example: • Analyzing CDN logs unveiled abuse • Stopping the abuse greatly reduced the bills
  • 9. STEP 2 EMBRACE HADOOP photo by Doug Kukurudza
  • 10. HADOOP Not the holy grail, but deserves a central role It has a vibrant community and is proven to be: ECONOMICAL runs on commodity hardware SCALABLE smart distributed processing MAINTAINABLE very robust and fault-tolerant FLEXIBLE predefined schemas not required
  • 11. STEP 3 BUILD DASHBOARDS photo by Dawn Hopkins
  • 12. STATS PIPELINE BASED ON HADOOP Log collector HDFS MapReduce Dashboards HBase in batches continuous
  • 13. STATS PIPELINE BASED ON HADOOP Cfr. “lambda architecture” Log collector coined by @nathanmarz HDFS Realtime processing MapReduce Dashboards HBase in batches continuous
  • 14. STATS PIPELINE BASED ON HADOOP Cfr. “lambda architecture” Log collector coined by @nathanmarz HDFS Realtime Ad-hoc processing results MapReduce Dashboards HBase in batches continuous
  • 15. PYTHON IS AN AWESOME JACK OF ALL TRADES It is great for building dashboards: • Hadoop support: Dumbo, Python UDFs for Pig, ... • Several amazing web frameworks, e.g. Flask • Likewise for drawing graphs, e.g. PyCairo And it covers many other data science needs as well: • Scripting, prototyping and full-blown programming • NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...
  • 16. STEP 4 ASSEMBLE A TEAM photo by Jean-François Schmitz
  • 17. THE SECRET IS IN THE MIX Hadoop’s tricks also apply to data science teams • Avoid specialisation to allow easy distribution and scaling • Exploit data locality by hiring people with wide skill set Great Data Scientists have the right mix of skills • Hackers with solid technical background • Analytical mind that knows statistics and machine learning • Clever and creative in everything they do
  • 18. STEP 5 EXPLORE & INNOVATE photo by NASAr
  • 19. SOME TIPS AND TRICKS Dare to fail and/or start from estimates Introduce data exploration/innovation days • Basically 20% time devoted to playing with data • Incorporate brainstorming • Encourage collaboration Communicate findings to the rest of the company • Fun and silliness are allowed • Prototype early and often
  • 20. FIVE SIMPLE STEPS IS ALL IT TAKES 1 FOLLOW THE MONEY 2 EMBRACE HADOOP 3 BUILD DASHBOARDS 4 ASSEMBLE A TEAM 5 EXPLORE & INNOVATE
  • 21. FIVE SIMPLE STEPS IS ALL IT TAKES 1 FOLLOW THE MONEY 2 EMBRACE HADOOP Thanks! 3 BUILD DASHBOARDS Questions? 4 ASSEMBLE A TEAM 5 EXPLORE & INNOVATE