SlideShare une entreprise Scribd logo
1  sur  35
Realtime Data 
Analysis Patterns 
Mikio Braun 
@mikiobraun 
streamdrill & TU Berlin 
O'Really Strata+Hadoop, Barcelona 
Nov 21, 2014 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
How it all started: Realtime 
Twitter Retweet Trends 
Rails app + PostgreSQL 
About 100 tweets/second,and it got worse 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Road from there 
● Version 1.0: Rails + PostgreSQL 
– store and batch 
● Version 2.0: Scala + Cassandra 
– stream processing & working data on disk 
● Version 3.0: streamdrill 
– “in-memory realtime analytics database” 
– approximative algorithms to bound resources 
– moderate parallelism for some things 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Lessons learned? 
Not just one kind of 
realtime. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Applications 
FFiinnaannccee GGaammiinngg MMoonniittoorriinngg 
AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa 
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Two Dimensions of Real-Time 
Complexity Latency 
● counting 
● trends 
● outlier detection 
● recommendation 
● prediction (churn, 
etc.) 
● now (ms, RTB) 
● seconds (fraud) 
● hours (monitoring) 
● days (reporting) 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
What makes realtime hard 
● Many Events 
– 100 events / second 
– 360k per hour 
– 8.6M per day 
– 260M per month 
– 3.2B per year 
● Many Objects 
http://www.flickr.com/photos/arenamontanus/269158554/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Classes of Realtime 
● Events per second (100s? 1000s? 10k?) 
● Number of objects (A few dozen? Millions?) 
● Complexity (Counting? Trends?) 
● Latency (Milliseconds? Hours?) 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
General Architecture 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Data Acquisition 
● Flat files / HDFS 
● Apache Flume / Logstash 
● Apache Kafka for distributed logging 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Processing 
● Depending on Latency: Batch or Streaming 
● Batch 
– Apache Hadoop 
– Apache Spark 
– Apache Flink 
● Streaming 
– Apache Storm 
– Apache Samza 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Query Layer 
● Hadoop/Storm/Spark have no query layer 
● Some db backend like redis to store the results 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Lambda Architecture: Mixing 
Batch & Streaming 
http://lambda-architecture.net/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Kappa Architecture 
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Scaling vs. Approximation 
● Scaling is expensive 
● Not all results are relevant 
● Data changes all the time anyway 
● Approximate: 
Trade accuracy for resource usage 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Approximation harmful? 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Heavy Hitters 
● Count activities over large item sets (millions, even 
more, e.g. IP addresses, Twitter users) 
● Interested in most active elements only. 
frank 
paul 
jan 
felix 
leo 
alex 
15 
12 
8 
5 
3 
2 
Fixed tables of counts 
Case 1: element already in data base 
paul paul 12 13 
Case 2: new element 
nico alex 2 
nico 3 
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference 
on Database Theory, 2005 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Count Min Sketch 
● Summarize histograms over large feature sets 
● Like bloom filters, but better 
m bins 
0 0 3 0 
1 1 0 2 
0 2 0 0 
0 3 5 2 
0 5 3 2 
2 4 5 0 
1 3 7 3 
0 2 0 8 
n different 
hash functions 
Updates for new entry 
Query result: 1 
● Query: Take minimum over all hash functions 
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. 
LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Hyper Log Log 
● Hash stream to generate random bit strings 
● Look for infrequent events 
● If probability is one hundreths → should have 
seen 100 events on average if it occurs. 
● Average to improve estimate. 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Comparing Approx. Algorithms 
● Heavy Hitters: 
– approx. counts + top-k 
– large memory requirement 
● Count Min Sketch 
– approx. counts for all, but no top-k, no elements 
– needs to know size beforehand 
● HyperLogLog 
– approx. number of distinct elements 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Exponential Decay 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Beyond Counting 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Streamdrill & Demos 
● Realtime Analysis Solutions 
● Core Engine: 
– Heavy Hitters + exponential decay + seconndary indices 
– Instant counts & top-k results over time windows 
– In-memory 
– Written in Scala 
● Modules 
– Profiling and Trending 
– Recommendations 
– Count Distinct 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
http://play.streamdrill.com/vis/ 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
● Trends: 
– symbol:combinations $AAPL:$GOOG 
– symbol:hashtag $AAPL:#trading 
– symbol:keywords $GOOG:disruption 
– symbol:mentions $GOOG:WallStreetCom 
– symbol trend $AAPL 
– symbol:url $FB:http://on.wsj.com/15fHaZW 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Example: Twitter Stock Analysis 
Twitter 
streamdrill 
JavaScript 
via REST 
tweets 
Tweet Analyzer 
updates 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime User Profiles 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime user profiles 
● Process 10k events / second on one machine 
● Track about 1 Million counts per 1 GB 
● Shard by user for higher accuracy 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Realtime Data Analysis Patterns 
● Acquisition / Processing / Query Layer 
● Acquisition: Flat files and distributed logs 
● Processing: Scaling batch or streaming 
● Query Layer: Separate query from processing 
● Lambda and Kappa Architecture 
● Approximation as alternative to scaling 
● Trends with indices as building blocks for data 
analysis 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
Thank You 
Mikio Braun 
mikio@streamdrill.com 
@mikiobraun 
Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun

Contenu connexe

Tendances

Recency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industryRecency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industryQualex Asia
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
 
MeasureCamp_Custom GA4 Channel Groups with dbt
MeasureCamp_Custom GA4 Channel Groups with dbtMeasureCamp_Custom GA4 Channel Groups with dbt
MeasureCamp_Custom GA4 Channel Groups with dbtChristopher Gutknecht
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaTop 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaKai Wähner
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time dataAmazon Web Services
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAmazon Web Services
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkDataWorks Summit/Hadoop Summit
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
 
클라우드 네이티브를 위한 Confluent Cloud
클라우드 네이티브를 위한 Confluent Cloud클라우드 네이티브를 위한 Confluent Cloud
클라우드 네이티브를 위한 Confluent Cloudconfluent
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 

Tendances (20)

Recency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industryRecency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industry
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Sap fico overview
Sap fico overviewSap fico overview
Sap fico overview
 
MeasureCamp_Custom GA4 Channel Groups with dbt
MeasureCamp_Custom GA4 Channel Groups with dbtMeasureCamp_Custom GA4 Channel Groups with dbt
MeasureCamp_Custom GA4 Channel Groups with dbt
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaTop 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time data
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
클라우드 네이티브를 위한 Confluent Cloud
클라우드 네이티브를 위한 Confluent Cloud클라우드 네이티브를 위한 Confluent Cloud
클라우드 네이티브를 위한 Confluent Cloud
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 

Similaire à Realtime Data Analysis Patterns

How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...Codemotion Tel Aviv
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine LearningMikio L. Braun
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Nicola Sandoli
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataDevFest DC
 
6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your InterviewerPrachiVarshney7
 
Modern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays CubaModern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays Cubabridgetkromhout
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...Codemotion
 
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian LitaKafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R MeetupJo-fai Chow
 
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesRudiger Wolf
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Expertsgraphistry
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in PracticeMikio L. Braun
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizingKrist Wongsuphasawat
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 

Similaire à Realtime Data Analysis Patterns (20)

How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
 
6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer6 Open Source Data Science Projects To Impress Your Interviewer
6 Open Source Data Science Projects To Impress Your Interviewer
 
Modern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays CubaModern Monitoring - devopsdays Cuba
Modern Monitoring - devopsdays Cuba
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
 
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian LitaKafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
 
H2O at Poznan R Meetup
H2O at Poznan R MeetupH2O at Poznan R Meetup
H2O at Poznan R Meetup
 
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slides
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Experts
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizing
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 

Plus de Mikio L. Braun

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Mikio L. Braun
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlMikio L. Braun
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI ApplicationsMikio L. Braun
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Mikio L. Braun
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkMikio L. Braun
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An IntroductionMikio L. Braun
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine EinführungMikio L. Braun
 

Plus de Mikio L. Braun (7)

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into Flink
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
 

Dernier

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 

Dernier (20)

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 

Realtime Data Analysis Patterns

  • 1. Realtime Data Analysis Patterns Mikio Braun @mikiobraun streamdrill & TU Berlin O'Really Strata+Hadoop, Barcelona Nov 21, 2014 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 2. How it all started: Realtime Twitter Retweet Trends Rails app + PostgreSQL About 100 tweets/second,and it got worse Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 3. Road from there ● Version 1.0: Rails + PostgreSQL – store and batch ● Version 2.0: Scala + Cassandra – stream processing & working data on disk ● Version 3.0: streamdrill – “in-memory realtime analytics database” – approximative algorithms to bound resources – moderate parallelism for some things Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 4. Lessons learned? Not just one kind of realtime. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 5. Applications FFiinnaannccee GGaammiinngg MMoonniittoorriinngg AAddvveerrttiissmmeenntt SSeennssoorr NNeettwwoorrkkss SSoocciiaall MMeeddiiaa Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 6. Two Dimensions of Real-Time Complexity Latency ● counting ● trends ● outlier detection ● recommendation ● prediction (churn, etc.) ● now (ms, RTB) ● seconds (fraud) ● hours (monitoring) ● days (reporting) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 7. What makes realtime hard ● Many Events – 100 events / second – 360k per hour – 8.6M per day – 260M per month – 3.2B per year ● Many Objects http://www.flickr.com/photos/arenamontanus/269158554/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 8. Classes of Realtime ● Events per second (100s? 1000s? 10k?) ● Number of objects (A few dozen? Millions?) ● Complexity (Counting? Trends?) ● Latency (Milliseconds? Hours?) Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 9. General Architecture Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 10. Data Acquisition ● Flat files / HDFS ● Apache Flume / Logstash ● Apache Kafka for distributed logging Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 11. Processing ● Depending on Latency: Batch or Streaming ● Batch – Apache Hadoop – Apache Spark – Apache Flink ● Streaming – Apache Storm – Apache Samza Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 12. Query Layer ● Hadoop/Storm/Spark have no query layer ● Some db backend like redis to store the results Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 13. Lambda Architecture: Mixing Batch & Streaming http://lambda-architecture.net/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 14. Kappa Architecture http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 15. Scaling vs. Approximation ● Scaling is expensive ● Not all results are relevant ● Data changes all the time anyway ● Approximate: Trade accuracy for resource usage Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 16. Approximation harmful? Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 17. Heavy Hitters ● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) ● Interested in most active elements only. frank paul jan felix leo alex 15 12 8 5 3 2 Fixed tables of counts Case 1: element already in data base paul paul 12 13 Case 2: new element nico alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005 Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 18. Count Min Sketch ● Summarize histograms over large feature sets ● Like bloom filters, but better m bins 0 0 3 0 1 1 0 2 0 2 0 0 0 3 5 2 0 5 3 2 2 4 5 0 1 3 7 3 0 2 0 8 n different hash functions Updates for new entry Query result: 1 ● Query: Take minimum over all hash functions G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 19. Hyper Log Log ● Hash stream to generate random bit strings ● Look for infrequent events ● If probability is one hundreths → should have seen 100 events on average if it occurs. ● Average to improve estimate. Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 20. Comparing Approx. Algorithms ● Heavy Hitters: – approx. counts + top-k – large memory requirement ● Count Min Sketch – approx. counts for all, but no top-k, no elements – needs to know size beforehand ● HyperLogLog – approx. number of distinct elements Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 21. Exponential Decay Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 22. Beyond Counting Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 23. Streamdrill & Demos ● Realtime Analysis Solutions ● Core Engine: – Heavy Hitters + exponential decay + seconndary indices – Instant counts & top-k results over time windows – In-memory – Written in Scala ● Modules – Profiling and Trending – Recommendations – Count Distinct Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 24. Example: Twitter Stock Analysis http://play.streamdrill.com/vis/ Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 25. Example: Twitter Stock Analysis ● Trends: – symbol:combinations $AAPL:$GOOG – symbol:hashtag $AAPL:#trading – symbol:keywords $GOOG:disruption – symbol:mentions $GOOG:WallStreetCom – symbol trend $AAPL – symbol:url $FB:http://on.wsj.com/15fHaZW Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 26. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 27. Example: Twitter Stock Analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 28. Example: Twitter Stock Analysis Twitter streamdrill JavaScript via REST tweets Tweet Analyzer updates Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 29. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 30. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 31. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 32. Realtime User Profiles Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 33. Realtime user profiles ● Process 10k events / second on one machine ● Track about 1 Million counts per 1 GB ● Shard by user for higher accuracy Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 34. Realtime Data Analysis Patterns ● Acquisition / Processing / Query Layer ● Acquisition: Flat files and distributed logs ● Processing: Scaling batch or streaming ● Query Layer: Separate query from processing ● Lambda and Kappa Architecture ● Approximation as alternative to scaling ● Trends with indices as building blocks for data analysis Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun
  • 35. Thank You Mikio Braun mikio@streamdrill.com @mikiobraun Mikio L. Braun, @mikiobraun Realtime Data Analysis Patterns (c) 2014 by Mikio Braun