SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
Spark & Storm: When & Where?
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial on Twitter
Me!
www.mammothdata.com | @mammothdataco
● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in practice
● Performance and tuning tips we’ve learnt over the past two years
● …and when to pack it all in and use Storm instead
What This Talk Is About
www.mammothdata.com | @mammothdataco
This IS WEB SCALE!
www.mammothdata.com | @mammothdataco
● I kid, Rails!
● (mostly)
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Spark & Storm - millions of requests / second on commodity
hardware
● Different problems at different scales!
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distributed Dataset (RDD) primitive
Spark
www.mammothdata.com | @mammothdataco
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
Spark Streaming — In Production?
● Yes!
● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
www.mammothdata.com | @mammothdataco
● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / batch
● Every X seconds, get Y records (DStream/RDDs)
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
● Using same implementation (mostly) for batch and stream
processing (Lambda Architecture hipster points ahoy!)
● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.
Spark Streaming — Good Things
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● What happens if you require sub-second latency?
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — I’m so sorry.
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● Data builds up in executors
● Executors run out of memory…
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● “Hey, we forgot to tell you Ops people that we have a major new
client adding stuff into the firehose sometime today. That’s fine,
right?”
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — It Will Be Okay
www.mammothdata.com | @mammothdataco
● As a former Ops person:
● WE WILL REMEMBER.
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, let’s dive in…
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide tight execution times!
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics
without Write Ahead Logs and receiver-less operation in 1.3+)
● (other sources get the features…eventually)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New features always come to Scala first.
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● (or Java if you really must)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receivers
● For Great Data Locality and Parallelism!
Spark Streaming — Cores
www.mammothdata.com | @mammothdataco
● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…
rdd.unpersist()
}
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● If routing to multiple stores / iterating over an RDD multiple
times using cache() is a quick win
● It really shouldn’t work so well…
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamically alters incoming data rates (keeping the data in
Kafka rather than in the executors)
● Works for all data sources (for once!)
Spark Streaming — Backpressure
www.mammothdata.com | @mammothdataco
● I really need that low-latency response!
Storm
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
Storm
www.mammothdata.com | @mammothdataco
Spark
“Very Good, Sir”
www.mammothdata.com | @mammothdataco
Storm
“Here you go!”
www.mammothdata.com | @mammothdataco
● Stream of tuples
● Bolts
● Spouts
● Topologies
Storm Concepts
www.mammothdata.com | @mammothdataco
● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus custom
serializers)
Storm — Streams
www.mammothdata.com | @mammothdataco
● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitting them
● Can emit multiple streams from a spout!
Storm — Spouts
www.mammothdata.com | @mammothdataco
● Where your processing happens
● Roll your own aggregations / filtering / windowing
● Bolts can feed into other bolts
● Potentially easier to test than Spark Streaming
● Many Bolt connectors for external sources (e.g. Cassandra,
Redis, Hive, etc)
Storm — Bolts
www.mammothdata.com | @mammothdataco
● The DAG of the spouts and bolts
● Built programmatically in code and submitted to the Storm
cluster
● Flux - Do It In YAML (and then complain about whitespace)
Storm — Topologies
www.mammothdata.com | @mammothdataco
● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm
● Set in topology submission
Storm — Tasks
www.mammothdata.com | @mammothdataco
● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Storm distributes tasks evenly across cluster
Storm — Workers
www.mammothdata.com | @mammothdataco
● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far beyond Spark Streaming (currently)
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to support 1000+
nodes
● Single node clocked at over 1.5m tuples / second at Twitter
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be optimal
● Operationally more complex (and Storm WebUI is more primitive)
● Where’s Me REPL?
Storm — Bad Things
www.mammothdata.com | @mammothdataco
Spark or Storm?
www.mammothdata.com | @mammothdataco
● SLA on latency?
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though simply because it’s possible doesn’t mean you’ll get it!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Insane data needs (e.g. ~100m records/second?)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though, again, it’s not a magic bullet!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of Spark delivers improvements!
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Other frameworks that show promise:
○ Flink
○ Apex
○ Samza
○ Heron (Twitter’s not-public Storm replacement)
Other Listing Magazines Are Available
www.mammothdata.com | @mammothdataco
Questions?

Contenu connexe

En vedette

En vedette (12)

[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails
[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails	[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails
[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails
 
Getting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbon
Getting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbonGetting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbon
Getting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbon
 
Javascript State of the Union 2015 - English
Javascript State of the Union 2015 - EnglishJavascript State of the Union 2015 - English
Javascript State of the Union 2015 - English
 
HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017
HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017
HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017
 
100% Renewable Energy by 2050: Fact or Fantasy
100% Renewable Energy by 2050: Fact or Fantasy100% Renewable Energy by 2050: Fact or Fantasy
100% Renewable Energy by 2050: Fact or Fantasy
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Consumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice ArchitectureConsumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice Architecture
 
The Wealthfront Equity Plan (Stanford GSB, March 2016)
The Wealthfront Equity Plan (Stanford GSB, March 2016)The Wealthfront Equity Plan (Stanford GSB, March 2016)
The Wealthfront Equity Plan (Stanford GSB, March 2016)
 
The State of Sales & Marketing at the 50 Fastest-Growing B2B Companies
The State of Sales & Marketing at the 50 Fastest-Growing B2B CompaniesThe State of Sales & Marketing at the 50 Fastest-Growing B2B Companies
The State of Sales & Marketing at the 50 Fastest-Growing B2B Companies
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Solve for X with AI: a VC view of the Machine Learning & AI landscape
Solve for X with AI: a VC view of the Machine Learning & AI landscapeSolve for X with AI: a VC view of the Machine Learning & AI landscape
Solve for X with AI: a VC view of the Machine Learning & AI landscape
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of Everything
 

Dernier

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Dernier (20)

SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

All Things Open - Spark & Storm - Where & When?