1. Data Science at Scale:
Using Apache Spark for Data Science
at Bitly
Sarah Guido
Spark Summit Europe 2015
2. Overview
• About me/Bitly
• Spark overview
• Using Spark for data science
• When it works, it’s great! When it works…
3. About me
• Data Scientist at Bitly
• NYC Python/PyGotham co-organizer
• O’Reilly Media author
• @sarah_guido
4. About this talk
• This talk is:
– Description of my workflow
– Exploration of within-Spark tools
• This talk is not:
– In-depth exploration of algorithms
– Building new tools on top of Spark
– Any sort of ground truth for how you should be
using Spark
5. A bit of background
• Need for big data analysis tools
• MapReduce for exploratory data analysis ==
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!
6. Bitly data!
• Legit big data
• 1 hour of decodes is 10 GB
• 1 day is 240 GB
• 1 month is ~7 TB
16. Topic modeling
• Problem: we have so many links but no way to
classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)
– Sort of – compare to other solutions
• Spark 1.4
17. Topic modeling
• LDA in Spark
– Generative model
– Several different methods
– Term frequency vector as input
• “Note: LDA is still an experimental feature
under active development...”
25. Architecture
• Right now: not in production
– Buy-in
• Streaming applications for parts of the app
• Python or Scala?
– Scala by force
26. Some issues
• Hadoop servers
• JVM
• gzip
• 1.4/resource allocation/EMR
• Lack of documentation
27. Where to go next?
• Spark in production!
• Use for various parts of our app
• Use for R&D and prototyping purposes, with
the potential to expand into the product