Data Science at Scale by Sarah Guido

•Télécharger en tant que PPTX, PDF•

1 j'aime•1,854 vues

Spark Summit

Données & analyses

Data Science at Scale:
Using Apache Spark for Data Science
at Bitly
Sarah Guido
Spark Summit Europe 2015

Overview
• About me/Bitly
• Spark overview
• Using Spark for data science
• When it works, it’s great! When it works…

About me
• Data Scientist at Bitly
• NYC Python/PyGotham co-organizer
• O’Reilly Media author
• @sarah_guido

About this talk
• This talk is:
– Description of my workflow
– Exploration of within-Spark tools
• This talk is not:
– In-depth exploration of algorithms
– Building new tools on top of Spark
– Any sort of ground truth for how you should be
using Spark

A bit of background
• Need for big data analysis tools
• MapReduce for exploratory data analysis == 
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!

Bitly data!
• Legit big data
• 1 hour of decodes is 10 GB
• 1 day is 240 GB
• 1 month is ~7 TB

Why Spark?
• Fast. Really fast.
• Distributed scientific tools
• Python! (Sometimes.)
• Cutting edge technology
• AWS/EMR/S3

Setting up the workflow
• Spark journey
– Hadoop server: 1.2 – Python
– EMR: 1.3 – Python
– EMR: 1.4 – Python/Scala
– EMR: 1.5 – Scala

Let’s set the stage…
• Understanding user behavior
• How do I extract, explore, and model a subset
of our data using Spark?

$Data {"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why- health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}$

Data processing
• Problem: I want to retrieve NYT decodes
• Solution: well, there are two…
• Spark 1.3

Data processing
• SparkSQL: 8 minutes
• Pure Spark: 4 minutes!!!

Topic modeling
• Problem: we have so many links but no way to
classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)
– Sort of – compare to other solutions
• Spark 1.4

Topic modeling
• LDA in Spark
– Generative model
– Several different methods
– Term frequency vector as input
• “Note: LDA is still an experimental feature
under active development...”

Topic modeling
• Term frequency vector
TERM
DOCUMENT
python data hot dogs baseball zoo
doc_1 1 3 0 0 0
doc_2 0 0 4 1 0
doc_3 4 0 0 0 5

Trend Detection
• Tell our clients when a particular piece of
content is trending
• Transition to Scala
• Workflow improvement
• EMR + Spark 1.5 + Jupyter + Scala!

Architecture
• Right now: not in production
– Buy-in
• Streaming applications for parts of the app
• Python or Scala?
– Scala by force

Some issues
• Hadoop servers
• JVM
• gzip
• 1.4/resource allocation/EMR
• Lack of documentation

Where to go next?
• Spark in production!
• Use for various parts of our app
• Use for R&D and prototyping purposes, with
the potential to expand into the product

Resources/Source Material
• spark.apache.org - documentation
• Databricks blog
• Cloudera blog
• Other Spark users!

Recommandé

Big Telco - Yousun JeongSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

From R Script to Production Using rsparkling with Navdeep GillDatabricks

Demystifying Data Engineeringnathanmarz

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit

Scala: the unpredicted lingua franca for data scienceAndy Petrella

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Recommandé

Big Telco - Yousun JeongSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

From R Script to Production Using rsparkling with Navdeep GillDatabricks

Demystifying Data Engineeringnathanmarz

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit

Scala: the unpredicted lingua franca for data scienceAndy Petrella

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Data Science with Spark & ZeppelinVinay Shukla

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Databricks

Strata EU 2014: Spark Streaming Case StudiesPaco Nathan

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

Cascalog at May Bay Area Hadoop User Groupnathanmarz

H2O World - H2O Rains with Databricks CloudSri Ambati

SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks

Rethinking Streaming Analytics For ScaleHelena Edelson

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science

Building Data Pipelines with Spark and StreamSetsPat Patterson

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Intro to Apache SparkCloudera, Inc.

End to End Streaming ArchitecturesCloudera, Inc.

Contenu connexe

Tendances

Data Science with Spark & ZeppelinVinay Shukla

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Databricks

Strata EU 2014: Spark Streaming Case StudiesPaco Nathan

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks

Cascalog at May Bay Area Hadoop User Groupnathanmarz

H2O World - H2O Rains with Databricks CloudSri Ambati

SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks

Rethinking Streaming Analytics For ScaleHelena Edelson

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science

Building Data Pipelines with Spark and StreamSetsPat Patterson

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Tendances (20)

Data Science with Spark & Zeppelin

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...

Strata EU 2014: Spark Streaming Case Studies

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Cascalog at May Bay Area Hadoop User Group

H2O World - H2O Rains with Databricks Cloud

SparkApplicationDevMadeEasy_Spark_Summit_2015

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Rethinking Streaming Analytics For Scale

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Data infrastructure architecture for medium size organization: tips for colle...

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...

Building Data Pipelines with Spark and StreamSets

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

En vedette

Intro to Apache SparkCloudera, Inc.

End to End Streaming ArchitecturesCloudera, Inc.

Using Spark with Tachyon by Gene PangSpark Summit

Apache Spark & ScalaEdureka!

Introduction to Apache Sparkdatamantra

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

Apache Spark ArchitectureAlexey Grishchenko

En vedette (8)

Intro to Apache Spark

End to End Streaming Architectures

Using Spark with Tachyon by Gene Pang

Apache Spark & Scala

Introduction to Apache Spark

Introduction to Apache Spark Developer Training

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...

Apache Spark Architecture

Similaire à Data Science at Scale by Sarah Guido

Data Day Seattle 2015: Sarah GuidoBitly

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Abhishek Training PPT.pptxKashishKashish22

Big Data Processing with Apache Spark 2014mahchiev

“Filling the digital preservation gap”an update from the Jisc Research Data ...Jenny Mitcham

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.

How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Reproducible Research with R, The Tidyverse, Notebooks, and SparkAdaryl "Bob" Wakefield, MBA

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman

02-Lifecycle.pptxShree Shree

Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi

Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman

How and why you need to build a big data labChris Kernaghan

Similaire à Data Science at Scale by Sarah Guido (20)

Data Day Seattle 2015: Sarah Guido

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Abhishek Training PPT.pptx

Big Data Processing with Apache Spark 2014

“Filling the digital preservation gap”an update from the Jisc Research Data ...

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

How Oracle Uses CrowdFlower For Sentiment Analysis

Apache Spark for Everyone - Women Who Code Workshop

Reproducible Research with R, The Tidyverse, Notebooks, and Spark

Architecting Your First Big Data Implementation

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...

02-Lifecycle.pptx

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Using Oracle Big Data Discovey as a Data Scientist's Toolkit

How and why you need to build a big data lab

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

Goal Based Data Production with Sim SimeonovSpark Summit

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...

Dernier

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Easter Eggs From Star Wars and in cars 1 and 217djon017

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Machine learning classification ppt.pptamreenkhanum0307

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Dernier (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Easter Eggs From Star Wars and in cars 1 and 2

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Defining Constituents, Data Vizzes and Telling a Data Story

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Heart Disease Classification Report: A Data Analysis Project

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Machine learning classification ppt.ppt

20240419 - Measurecamp Amsterdam - SAM.pdf

Semantic Shed - Squashing and Squeezing.pptx

Top 5 Best Data Analytics Courses In Queens

DBA Basics: Getting Started with Performance Tuning.pdf

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Data Factory in Microsoft Fabric (MsBIP #82)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Student Profile Sample report on improving academic performance by uniting gr...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Call Girls In Dwarka 9654467111 Escorts Service

Data Science at Scale by Sarah Guido

1. Data Science at Scale: Using Apache Spark for Data Science at Bitly Sarah Guido Spark Summit Europe 2015

2. Overview • About me/Bitly • Spark overview • Using Spark for data science • When it works, it’s great! When it works…

3. About me • Data Scientist at Bitly • NYC Python/PyGotham co-organizer • O’Reilly Media author • @sarah_guido

4. About this talk • This talk is: – Description of my workflow – Exploration of within-Spark tools • This talk is not: – In-depth exploration of algorithms – Building new tools on top of Spark – Any sort of ground truth for how you should be using Spark

5. A bit of background • Need for big data analysis tools • MapReduce for exploratory data analysis ==  • Iterate/prototype quickly • Overall goal: understand how people use not only our app, but the Internet!

6. Bitly data! • Legit big data • 1 hour of decodes is 10 GB • 1 day is 240 GB • 1 month is ~7 TB

7. Why Spark? • Fast. Really fast. • Distributed scientific tools • Python! (Sometimes.) • Cutting edge technology • AWS/EMR/S3

8. Setting up the workflow • Spark journey – Hadoop server: 1.2 – Python – EMR: 1.3 – Python – EMR: 1.4 – Python/Scala – EMR: 1.5 – Scala

9. Let’s set the stage… • Understanding user behavior • How do I extract, explore, and model a subset of our data using Spark?

10. Data {"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why- health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}

11. Data processing • Problem: I want to retrieve NYT decodes • Solution: well, there are two… • Spark 1.3

12. Data processing

13. Data processing

14. Data processing • SparkSQL: 8 minutes • Pure Spark: 4 minutes!!!

15. Data processing

16. Topic modeling • Problem: we have so many links but no way to classify them into certain kinds of content • Solution: LDA (latent Dirichlet allocation) – Sort of – compare to other solutions • Spark 1.4

17. Topic modeling • LDA in Spark – Generative model – Several different methods – Term frequency vector as input • “Note: LDA is still an experimental feature under active development...”

18. Topic modeling

19. Topic modeling • Term frequency vector TERM DOCUMENT python data hot dogs baseball zoo doc_1 1 3 0 0 0 doc_2 0 0 4 1 0 doc_3 4 0 0 0 5

20. Topic modeling

21. Topic modeling

22. Trend Detection • Tell our clients when a particular piece of content is trending • Transition to Scala • Workflow improvement • EMR + Spark 1.5 + Jupyter + Scala!

23. Trend Detection

24. Trend Detection

25. Architecture • Right now: not in production – Buy-in • Streaming applications for parts of the app • Python or Scala? – Scala by force

26. Some issues • Hadoop servers • JVM • gzip • 1.4/resource allocation/EMR • Lack of documentation

27. Where to go next? • Spark in production! • Use for various parts of our app • Use for R&D and prototyping purposes, with the potential to expand into the product

28. Resources/Source Material • spark.apache.org - documentation • Databricks blog • Cloudera blog • Other Spark users!

29. Thanks!! @sarah_guido