Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Yahoo compares Storm and Spark
1. Spark and Storm at Yahoo
Wh y c h o o s e o n e o v e r t h e o t h e r ?
P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
2. Tom Graves
Bobby Evans (bobby@apache.org)
2
Committers and PMC/PPMC Members for
› Apache Storm incubating (Bobby)
› Apache Hadoop (Tom and Bobby)
› Apache Spark (Tom and Bobby)
› Apache TEZ (Tom and Bobby)
Low Latency Big Data team at Yahoo (Part of the Hadoop Team)
› Apache Storm as a service
• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes).
› Apache Spark on YARN
• 40,000 nodes total, 5000+ node cluster
› Help with distributed ML and deep learning.
3. Where we come from
Yahoo Champaign:
• 100+ engineers
• Located in UIUC Research Park http://researchpark.illinois.edu/
• Split between Advertising and Data Platform team and Hadoop team.
• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo.
• Site is 7 years old, and we are building a new building with room for 200.
• We are Hiring
• resume-hadoop@yahoo-inc.com
• http://bit.ly/1ybTXMe
6. Spark Key Concepts
Write programs in terms of
transformations on distributed
Resilient Distributed
Datasets
Collections of objects spread
across a cluster, stored in RAM
or on Disk
Built through parallel
transformations
Automatically rebuilt on failure
Operations
Transformations
(e.g. map, filter,
groupBy)
Actions
(e.g. count, collect,
save)
datasets
7. Working With RDDs
RDD
RDD
RDD
RDD
Transformations
textFile = sc.textFile(”SomeFile.txt”)
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
11. Storm Concepts
1. Streams
› Unbounded sequence of tuples
2. Spout
› Source of Stream
› E.g. Read from Twitter streaming API
3. Bolts
› Processes input streams and produces
new streams
› E.g. Functions, Filters, Aggregation,
Joins
4. Topologies
› Network of spouts and bolts
13. Trident (Storm) Word Count
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(),
new Fields("count")).parallelismHint(6);
“to be or”
“to”
“be”
“or”
(to, 1)
(be, 1)
(or, 1)
1)
1)
“not to be”
“not”
“to”
“be”
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)
14. Use the Right Tool for the Job
14
https://www.flickr.com/photos/hikingartist/4193330368/
15. Things to Consider
15
Scale
Latency
Iterative Processing
› Are there suitable non-iterative alternatives?
Use What You Know
Code Reuse
Maturity
16. When We Recommend Spark
16
Iterative Batch Processing (most Machine Learning)
› There really is nothing else right now.
› Has some scale issues.
Tried ETL (Not at Yahoo scale yet)
Tried Shark/Interactive Queries (Not at Yahoo scale yet)
< 1 TB (or memory size of your cluster)
Tuning it to run well can be a pain
Data Bricks and others are working on scaling.
Streaming is all μ-batch so latency is at least 1 sec
Streaming has single points of failure still
All streaming inputs are replicated in memory
17. When We Recommend Storm
17
Latency < 1 second (single event at a time)
› There is little else (especially not open source)
“Real Time” …
› Analytics
› Budgeting
› ML
› Anything
Lower Level API than Spark
No built-in concept of look back aggregations
Takes more effort to combine batch with streaming
18. Fictitious Example: My Commute App
18
Mobile App that lets users track their commute.
Cities, users, companies, etc. compete daily for
› Shortest commute time
› Greenest commute
Make money by selling location based ads and aggregate data to
› Governments
› Advertisers
Feel free to steal my crazy idea, I just want to be invited to the launch
party, and I wouldn't say no to some stock.
19. Chicago vs. Champaign Urbana
19
Champaign Urbana: 14-15 min
Chicago: 20-30 min
35
30
25
20
15
10
5
0
Bobby
CU Chicago
Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
20. Things to Consider
20
Scale
› everyone in the world!!!
Latency
› a few seconds max
Iterative Processing
› Possibly for targeting, but there are alternatives
21. Architecture
App Web
Service
(User, Commute
ID, Location
History, MPG)
Kafka Storm
HBase/NoSQ
L
HDFS Spark
Customer
21
22. Architecture (Alternative)
App Web
Service
(User, Commute
ID, Location
History, MPG)
HBase/NOS
QL
HDFS Spark
Customer
22
Go directly to Spark Streaming,
but data loss potential goes up.
23. Architecture (Alternative 2)
App Web
Service
(User, Commute
ID, Location
History, MPG)
Kafka Storm
HBase/NOS
QL
Customer
23
Streaming Operations Only
(Kappa Architecture)
24. Fictitious Example 2: Web Scale Monitoring
24
Look for trends that can indicate a problem.
› Alert or provide automated corrections
Provide an interface to visualize
› Current data very quickly
› Historical data in depth
If you commercialize this one please give me/Yahoo a free license for
life (open source works too)
25. Things to Consider
25
Scale
› Lots of events from many different servers
Latency
› a few seconds max, but the fewer the better
Iterative Processing
› For in depth analysis definetly
26. Fictitious Example 2: Web Scale Monitoring
26
Servers
HBase
Kafka Storm
HDFS Spark
UI
Alert!!
JDBC
Server
Rules
ML and trend
analysis