Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Realtime Data Analysis Patterns
1. Realtime Data
Analysis Patterns
Mikio Braun
@mikiobraun
streamdrill & TU Berlin
O'Really Strata+Hadoop, Barcelona
Nov 21, 2014
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
2. How it all started: Realtime
Twitter Retweet Trends
Rails app + PostgreSQL
About 100 tweets/second,and it got worse
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
3. Road from there
● Version 1.0: Rails + PostgreSQL
– store and batch
● Version 2.0: Scala + Cassandra
– stream processing & working data on disk
● Version 3.0: streamdrill
– “in-memory realtime analytics database”
– approximative algorithms to bound resources
– moderate parallelism for some things
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
4. Lessons learned?
Not just one kind of
realtime.
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
6. Two Dimensions of Real-Time
Complexity Latency
● counting
● trends
● outlier detection
● recommendation
● prediction (churn,
etc.)
● now (ms, RTB)
● seconds (fraud)
● hours (monitoring)
● days (reporting)
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
7. What makes realtime hard
● Many Events
– 100 events / second
– 360k per hour
– 8.6M per day
– 260M per month
– 3.2B per year
● Many Objects
http://www.flickr.com/photos/arenamontanus/269158554/
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
8. Classes of Realtime
● Events per second (100s? 1000s? 10k?)
● Number of objects (A few dozen? Millions?)
● Complexity (Counting? Trends?)
● Latency (Milliseconds? Hours?)
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
10. Data Acquisition
● Flat files / HDFS
● Apache Flume / Logstash
● Apache Kafka for distributed logging
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
11. Processing
● Depending on Latency: Batch or Streaming
● Batch
– Apache Hadoop
– Apache Spark
– Apache Flink
● Streaming
– Apache Storm
– Apache Samza
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
12. Query Layer
● Hadoop/Storm/Spark have no query layer
● Some db backend like redis to store the results
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
13. Lambda Architecture: Mixing
Batch & Streaming
http://lambda-architecture.net/
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
15. Scaling vs. Approximation
● Scaling is expensive
● Not all results are relevant
● Data changes all the time anyway
● Approximate:
Trade accuracy for resource usage
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
17. Heavy Hitters
● Count activities over large item sets (millions, even
more, e.g. IP addresses, Twitter users)
● Interested in most active elements only.
frank
paul
jan
felix
leo
alex
15
12
8
5
3
2
Fixed tables of counts
Case 1: element already in data base
paul paul 12 13
Case 2: new element
nico alex 2
nico 3
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference
on Database Theory, 2005
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
18. Count Min Sketch
● Summarize histograms over large feature sets
● Like bloom filters, but better
m bins
0 0 3 0
1 1 0 2
0 2 0 0
0 3 5 2
0 5 3 2
2 4 5 0
1 3 7 3
0 2 0 8
n different
hash functions
Updates for new entry
Query result: 1
● Query: Take minimum over all hash functions
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.
LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
19. Hyper Log Log
● Hash stream to generate random bit strings
● Look for infrequent events
● If probability is one hundreths → should have
seen 100 events on average if it occurs.
● Average to improve estimate.
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
20. Comparing Approx. Algorithms
● Heavy Hitters:
– approx. counts + top-k
– large memory requirement
● Count Min Sketch
– approx. counts for all, but no top-k, no elements
– needs to know size beforehand
● HyperLogLog
– approx. number of distinct elements
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
21. Exponential Decay
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
22. Beyond Counting
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
23. Streamdrill & Demos
● Realtime Analysis Solutions
● Core Engine:
– Heavy Hitters + exponential decay + seconndary indices
– Instant counts & top-k results over time windows
– In-memory
– Written in Scala
● Modules
– Profiling and Trending
– Recommendations
– Count Distinct
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
24. Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
25. Example: Twitter Stock Analysis
● Trends:
– symbol:combinations $AAPL:$GOOG
– symbol:hashtag $AAPL:#trading
– symbol:keywords $GOOG:disruption
– symbol:mentions $GOOG:WallStreetCom
– symbol trend $AAPL
– symbol:url $FB:http://on.wsj.com/15fHaZW
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
26. Example: Twitter Stock Analysis
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
27. Example: Twitter Stock Analysis
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
28. Example: Twitter Stock Analysis
Twitter
streamdrill
JavaScript
via REST
tweets
Tweet Analyzer
updates
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
29. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
30. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
31. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
32. Realtime User Profiles
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
33. Realtime user profiles
● Process 10k events / second on one machine
● Track about 1 Million counts per 1 GB
● Shard by user for higher accuracy
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
34. Realtime Data Analysis Patterns
● Acquisition / Processing / Query Layer
● Acquisition: Flat files and distributed logs
● Processing: Scaling batch or streaming
● Query Layer: Separate query from processing
● Lambda and Kappa Architecture
● Approximation as alternative to scaling
● Trends with indices as building blocks for data
analysis
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
35. Thank You
Mikio Braun
mikio@streamdrill.com
@mikiobraun
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun