SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
Scott Haines, Twilio
Streaming Trend Discovery:
real-time discovery in a sea of events
#DDSAIS16
@newfront
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16
Agenda
• Intro and use cases for Spark at Twilio (5 min)
• Elements of a Streaming Discovery Engine (5 min)
• Principals of Valuable Data Sets (4 min)
• Refresher on Histograms and Percentiles (2 min)
• Building a Discovery Engine on Spark: Part One (15 min)
• Break?: 10 min
• Building a Discovery Engine on Spark: Part Two (15 min)
• Lessons learned (5 min)
• Questions (5 min)
Project Files on Github @ https://bit.ly/2ISI3BS
2
#DDSAIS16 Github https://bit.ly/2ISI3BS
Scott Haines
• 6+ years working with Streaming Systems
• Yahoo! Games
• Recommendations, Personalization
• Daily Fantasy Growth / Analytics
• Mobile Campaigns, Fraud, Trends
• Flurry Analytics
• Statistics based notification system
• Stats / Data Science Enthusiast
• Drummer / Snowboarder / Hot Sauce
Maker / Teacher
3
Principal Engineer, Voice Insights @ Twilio
@newfront
#DDSAIS16 Github https://bit.ly/2ISI3BS
Twilio
We are a public, large-scale, globally available & distributed
communications as-a-service company.
• Founded in 2008. IPO'd in 2016
• Telecommunications is our bread and butter
• Along with Video / SMS / MMS / Wireless Sim / Authy / Sync
• Find out more and sign up @ twilio.com
4
Where I work
twilio.com
#DDSAIS16 Github https://bit.ly/2ISI3BS
Use cases for Spark @Twilio
• Real-Time Dashboards and Debugging
• Fraud Detection
• Carrier Behavior and Network Anomaly Detection
• Streaming ETL, Metrics and Aggregations
• ASR and Sentiment
5
#DDSAIS16 Github https://bit.ly/2ISI3BS
Voice Insights @Twilio
• Processing >3 TB /d
• >4 billion raw events /d
• >2 billion metric & sessions aggregations /d
• Powers Voice Insights / Carrier Ops / Support and more
6
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16
Next up
• Elements of the Streaming Trend Discovery Architecture
7
#DDSAIS16 Github https://bit.ly/2ISI3BS 8
Building a Discovery Engine
End to End Architecture
API raw events Streaming Aggregation metric aggs
S3
ElasticTrend Processing
Recent Metrics
Historic Metrics
events
Historic Statistics <50ms response time
*Generalized ESD or Grubbs & Distance Function
#DDSAIS16 Github https://bit.ly/2ISI3BS
Core Principals of Valuable Data Sets
• Data must be Structured. No JSON!
9
#DDSAIS16 Github https://bit.ly/2ISI3BS
Core Principals of Valuable Data Sets
• Data must be Structured. No JSON!
• JSON ~= Object
10
#DDSAIS16 Github https://bit.ly/2ISI3BS
Core Principals of Valuable Data Sets
• Data must be Structured. No JSON!
• JSON ~= Object
• Avro, Protobuf, Parquet are widely used and can support
backwards compatible updates.
11
#DDSAIS16 Github https://bit.ly/2ISI3BS
Core Principals Establish a Data Contract
12
Structured Data has Types and are part of the Data Contract
• Properties of the Data must be
documented.
• Data must be Accurate and
Complete*
• Data must support backwards and
forwards compatibility.
• Data should be easily flattened.
#DDSAIS16 Github https://bit.ly/2ISI3BS
Core Principals Final Thoughts
13
• Structured Data allows for compile time guarantees
• Write once. Compile across languages
• Non-Java Object Serialization
• External Serialization simplifies deploys given you can
simply use the last stored session state*
* don't corrupt your stored session data
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16
Building our mental model
• Quick Refresher on Percentiles and Histograms
14
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Percentiles group numeric data in
a set of observations.
15
Building a Discovery Engine
Primer: What is a Percentile?
S = {1,2,3,4,5,6,7,8,9,10}
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Percentiles group numeric data in
a set of observations.
• Common splits are min, 25%,50%
(median),75% and max
16
Building a Discovery Engine
Primer: What is a Percentile?
S = {1,2,3,4,5,6,7,8,9,10}
min:1
p25: 3.25
median: 5.50
p75: 7.75
max: 10.0
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Histograms are bucketed
distributions of numeric datasets
17
Building a Discovery Engine
Primer: What is a Histogram?
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Histograms are bucketed
distributions of numeric datasets
• In a Normal Distribution, we talk
about the Area Under the Curve
18
Building a Discovery Engine
Primer: What is a Histogram?
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Histograms are bucketed
distributions of numeric datasets
• A Normal Distribution, fit into the
Area Under the Curve
• Histograms allow us to understand
the shape of a distribution
19
Building a Discovery Engine
Primer: What is a Histogram?
68-95-99.7 Rule
- 68%@1sd->mean,95@2x,99.7@3x)
#DDSAIS16 Github https://bit.ly/2ISI3BS 20
t1 t2 t3 t4
Building a Discovery Engine
Goal: Time Series Histogram Analysis
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Limitations: Not easily Aggregated or hard to extract
21
Building a Discovery Engine
Deep Dive: Quantiles / Histograms in Spark
#DDSAIS16 Github https://bit.ly/2ISI3BS
Define a new UDF based off of the
MutableAggregationBuffer's in Spark
22
Building a Discovery Engine
Deep Dive: Aggregation Extension Options
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Upside: Things are human
readable
23
Building a Discovery Engine
Deep Dive: Aggregation Extension Options
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Upside: Things are human
readable
• Downside: Performance* /
Control*
24
Building a Discovery Engine
Deep Dive: Aggregation Extension Options
#DDSAIS16 Github https://bit.ly/2ISI3BS
• Ditch the DataFrame's unstructured API and move over
to Datasets typed API.
• This enables the use of flatMapGroupsWithState
• Now with a little effort you have full control over the
aggregation strategy
25
Building a Discovery Engine
Deep Dive: Aggregation Extension Options : Final Thoughts
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16
Next up: after the break
Building a Discovery Engine: Part 2
1. Define the Problem (1m)
2. Build a plan of Attack (1m)
3. Model the Data (2m)
5. Build the Streaming Discovery Engine (~20m)
6. Hopefully time for lessons learned and Questions!
26
#DDSAIS16 Github https://bit.ly/2ISI3BS
Global Telephony is tough.
27
#DDSAIS16 Github https://bit.ly/2ISI3BS
Simply Connecting two endpoints is ripe for failure.
28
#DDSAIS16 Github https://bit.ly/2ISI3BS 29
X
Given Networks can go down at anytime.
#DDSAIS16 Github https://bit.ly/2ISI3BS 30
We should build a system to
auto-discover trends in (~real-time) for our KPMs
and automatically report them...
#DDSAIS16 Github https://bit.ly/2ISI3BS 31
Building a Discovery Engine
End to End Architecture
API raw events Streaming Aggregation metric aggs
S3
ElasticTrend Processing
Recent Metrics
Historic Metrics
events
Historic Statistics <50ms response time
*Generalized ESD or Grubbs & Distance Function
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
32
Define the raw event
• Must provide the information
necessary for aggregation
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
33
Define the raw event
• Must provide the information
necessary for aggregation
• id must be idempotent. Brownie
points if it can be reused in Hash
Partitioner
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
34
Define the raw event
• Must provide the information
necessary for aggregation
• id must be idempotent. *dedup
key - Brownie points if it can be
reused in Hash Partitioner in
spark and kafka!
• event_dimensions + event_group
can create great grouping hash
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
35
Define the raw event
• Allows for event evolution while
supporting backwards compatibility.
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
36
Define the raw event
• Allows for event evolution while
supporting backwards compatibility.
• Enum ordinals can be used in ML
jobs without the need for string
encoding
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
37
Define the aggregation
• Includes logical time series Window
• Stores all Dimensional and Statistic
Information
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
38
Define the aggregation
• Includes logical time series Window
• Stores all Dimensional and Statistic
Information
• Can be used together as a strong
Binary Group Key
#DDSAIS16 Github https://bit.ly/2ISI3BS
Building a Discovery Engine
39
Why hash the dimensions?
• Hashing the dimensions of the
aggregation simplifies the
groupBy or groupByKey
mechanics.
• Use guava since it ships with
Spark and works well at scale for
Hashing.
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16
Next up
• Putting together what we've learned with Spark
• Available in the content @ https://bit.ly/2ISI3BS
• Walk through of critical components of the Application
40
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 41
Building a Discovery Engine
Creates the pipeline
* process(KDSR), inject
SparkSession (implicit*)
* write back to Kafka
* checkpoint, etc
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 42
Building a Discovery Engine
1. Transform (protobuf -
> Product)
2. Actual Aggregation
* groupedBy BinaryType
* fMGwS keyed by
BinaryType
* EventTimeTimeout*
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 43
Building a Discovery Engine
• StateSpec takes a
StateFunc
• This is for Key/Value pair
grouped data
• Key is [A]. It can be Binary.
Cool
• Take an Iterator[B]
• Stores a GroupState[C]
• Emits Iterator[D]
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 44
Building a Discovery Engine
metricAggregation
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 45
Building a Discovery Engine
• Data Sketches from Yahoo*
provides serializable Quantile
Sketches with Histogram
capabilities.
• Data Sketches are a light
weight representation of a
complete set of data (on a
diet - think HyperLogLog)
• We use this on our team to
enable performant
aggregation of complicated
statistics without storing all
rows in Spark memory.
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 46
Building a Discovery Engine
• Data Sketches from Yahoo*
provides serializable Quantile
Sketches with Histogram
capabilities.
• Data Sketches are a light
weight representation of a
complete set of data (on a
diet - think HyperLogLog)
• We use this on our team to
enable performant
aggregation of complicated
statistics without storing all
rows in Spark memory.
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 47
Building a Discovery Engine
• Testing should feel
good.
• Ensure you solve
Encoder issues early
• Make sure you don't
add regression's into
your 24/7/365 apps!
#DDSAIS16 Github https://bit.ly/2ISI3BS 48
Building a Discovery Engine
End to End Architecture
API raw events Streaming Aggregation metric aggs
S3
ElasticTrend Processing
Recent Metrics
Historic Metrics
events
Historic Statistics <50ms response time
*Generalized ESD or Grubbs & Distance Function
#DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16
Building the Continuous Trend Detector
1. Read from Kafka (raw events)
2. Look up *historic statistics from Elastic Search or Redis
3. Generate an anomaly stream based off of distance to statistical norm
4. Emit to Kafka for downstream use
49
Building a Discovery Engine
Anatomy of the application: Streaming Aggregation Query
#DDSAIS16 Github https://bit.ly/2ISI3BS
Discovering trends in massive data sets doesn't come
for free. There are core principals and ideologies that
must be followed in order to extract interesting patterns
from your data.
50
#DDSAIS16 Github https://bit.ly/2ISI3BS
Questions? / Thanks
51
@newfront
github@newfront
shaines@twilio.com
Thanks to Marek Radonsky and Rohit Gupta from the Voice Insights team who
helped make these patterns possible.

Contenu connexe

Tendances

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingDatabricks
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
 
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudSri Ambati
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 

Tendances (20)

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep Dive
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 

Similaire à Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott Haines

Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
DevOpsGuys FutureDecoded 2016 - is DevOps the Answer
DevOpsGuys FutureDecoded 2016 - is DevOps the AnswerDevOpsGuys FutureDecoded 2016 - is DevOps the Answer
DevOpsGuys FutureDecoded 2016 - is DevOps the AnswerDevOpsGroup
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevAltinity Ltd
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
From Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines KergosienFrom Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines KergosienITCamp
 
Pycon2017 instagram keynote
Pycon2017 instagram keynotePycon2017 instagram keynote
Pycon2017 instagram keynoteLisa Guo
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...Eric Reiche
 
Odsc london data science bootcamp with pixie dust
Odsc london data science bootcamp with pixie dustOdsc london data science bootcamp with pixie dust
Odsc london data science bootcamp with pixie dustDavid Taieb
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023Matthew Groves
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton Araf Karsh Hamid
 

Similaire à Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott Haines (20)

Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
DevOpsGuys FutureDecoded 2016 - is DevOps the Answer
DevOpsGuys FutureDecoded 2016 - is DevOps the AnswerDevOpsGuys FutureDecoded 2016 - is DevOps the Answer
DevOpsGuys FutureDecoded 2016 - is DevOps the Answer
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
From Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines KergosienFrom Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines Kergosien
 
Pycon2017 instagram keynote
Pycon2017 instagram keynotePycon2017 instagram keynote
Pycon2017 instagram keynote
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...The world is not black and white – Impact of decisions over the lifetime of a...
The world is not black and white – Impact of decisions over the lifetime of a...
 
Odsc london data science bootcamp with pixie dust
Odsc london data science bootcamp with pixie dustOdsc london data science bootcamp with pixie dust
Odsc london data science bootcamp with pixie dust
 
Monitoring your API
Monitoring your APIMonitoring your API
Monitoring your API
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023FluentMigrator - Dayton .NET - July 2023
FluentMigrator - Dayton .NET - July 2023
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Dernier (20)

Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott Haines

  • 1. Scott Haines, Twilio Streaming Trend Discovery: real-time discovery in a sea of events #DDSAIS16 @newfront
  • 2. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 Agenda • Intro and use cases for Spark at Twilio (5 min) • Elements of a Streaming Discovery Engine (5 min) • Principals of Valuable Data Sets (4 min) • Refresher on Histograms and Percentiles (2 min) • Building a Discovery Engine on Spark: Part One (15 min) • Break?: 10 min • Building a Discovery Engine on Spark: Part Two (15 min) • Lessons learned (5 min) • Questions (5 min) Project Files on Github @ https://bit.ly/2ISI3BS 2
  • 3. #DDSAIS16 Github https://bit.ly/2ISI3BS Scott Haines • 6+ years working with Streaming Systems • Yahoo! Games • Recommendations, Personalization • Daily Fantasy Growth / Analytics • Mobile Campaigns, Fraud, Trends • Flurry Analytics • Statistics based notification system • Stats / Data Science Enthusiast • Drummer / Snowboarder / Hot Sauce Maker / Teacher 3 Principal Engineer, Voice Insights @ Twilio @newfront
  • 4. #DDSAIS16 Github https://bit.ly/2ISI3BS Twilio We are a public, large-scale, globally available & distributed communications as-a-service company. • Founded in 2008. IPO'd in 2016 • Telecommunications is our bread and butter • Along with Video / SMS / MMS / Wireless Sim / Authy / Sync • Find out more and sign up @ twilio.com 4 Where I work twilio.com
  • 5. #DDSAIS16 Github https://bit.ly/2ISI3BS Use cases for Spark @Twilio • Real-Time Dashboards and Debugging • Fraud Detection • Carrier Behavior and Network Anomaly Detection • Streaming ETL, Metrics and Aggregations • ASR and Sentiment 5
  • 6. #DDSAIS16 Github https://bit.ly/2ISI3BS Voice Insights @Twilio • Processing >3 TB /d • >4 billion raw events /d • >2 billion metric & sessions aggregations /d • Powers Voice Insights / Carrier Ops / Support and more 6
  • 7. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 Next up • Elements of the Streaming Trend Discovery Architecture 7
  • 8. #DDSAIS16 Github https://bit.ly/2ISI3BS 8 Building a Discovery Engine End to End Architecture API raw events Streaming Aggregation metric aggs S3 ElasticTrend Processing Recent Metrics Historic Metrics events Historic Statistics <50ms response time *Generalized ESD or Grubbs & Distance Function
  • 9. #DDSAIS16 Github https://bit.ly/2ISI3BS Core Principals of Valuable Data Sets • Data must be Structured. No JSON! 9
  • 10. #DDSAIS16 Github https://bit.ly/2ISI3BS Core Principals of Valuable Data Sets • Data must be Structured. No JSON! • JSON ~= Object 10
  • 11. #DDSAIS16 Github https://bit.ly/2ISI3BS Core Principals of Valuable Data Sets • Data must be Structured. No JSON! • JSON ~= Object • Avro, Protobuf, Parquet are widely used and can support backwards compatible updates. 11
  • 12. #DDSAIS16 Github https://bit.ly/2ISI3BS Core Principals Establish a Data Contract 12 Structured Data has Types and are part of the Data Contract • Properties of the Data must be documented. • Data must be Accurate and Complete* • Data must support backwards and forwards compatibility. • Data should be easily flattened.
  • 13. #DDSAIS16 Github https://bit.ly/2ISI3BS Core Principals Final Thoughts 13 • Structured Data allows for compile time guarantees • Write once. Compile across languages • Non-Java Object Serialization • External Serialization simplifies deploys given you can simply use the last stored session state* * don't corrupt your stored session data
  • 14. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 Building our mental model • Quick Refresher on Percentiles and Histograms 14
  • 15. #DDSAIS16 Github https://bit.ly/2ISI3BS • Percentiles group numeric data in a set of observations. 15 Building a Discovery Engine Primer: What is a Percentile? S = {1,2,3,4,5,6,7,8,9,10}
  • 16. #DDSAIS16 Github https://bit.ly/2ISI3BS • Percentiles group numeric data in a set of observations. • Common splits are min, 25%,50% (median),75% and max 16 Building a Discovery Engine Primer: What is a Percentile? S = {1,2,3,4,5,6,7,8,9,10} min:1 p25: 3.25 median: 5.50 p75: 7.75 max: 10.0
  • 17. #DDSAIS16 Github https://bit.ly/2ISI3BS • Histograms are bucketed distributions of numeric datasets 17 Building a Discovery Engine Primer: What is a Histogram?
  • 18. #DDSAIS16 Github https://bit.ly/2ISI3BS • Histograms are bucketed distributions of numeric datasets • In a Normal Distribution, we talk about the Area Under the Curve 18 Building a Discovery Engine Primer: What is a Histogram?
  • 19. #DDSAIS16 Github https://bit.ly/2ISI3BS • Histograms are bucketed distributions of numeric datasets • A Normal Distribution, fit into the Area Under the Curve • Histograms allow us to understand the shape of a distribution 19 Building a Discovery Engine Primer: What is a Histogram? 68-95-99.7 Rule - 68%@1sd->mean,95@2x,99.7@3x)
  • 20. #DDSAIS16 Github https://bit.ly/2ISI3BS 20 t1 t2 t3 t4 Building a Discovery Engine Goal: Time Series Histogram Analysis
  • 21. #DDSAIS16 Github https://bit.ly/2ISI3BS • Limitations: Not easily Aggregated or hard to extract 21 Building a Discovery Engine Deep Dive: Quantiles / Histograms in Spark
  • 22. #DDSAIS16 Github https://bit.ly/2ISI3BS Define a new UDF based off of the MutableAggregationBuffer's in Spark 22 Building a Discovery Engine Deep Dive: Aggregation Extension Options
  • 23. #DDSAIS16 Github https://bit.ly/2ISI3BS • Upside: Things are human readable 23 Building a Discovery Engine Deep Dive: Aggregation Extension Options
  • 24. #DDSAIS16 Github https://bit.ly/2ISI3BS • Upside: Things are human readable • Downside: Performance* / Control* 24 Building a Discovery Engine Deep Dive: Aggregation Extension Options
  • 25. #DDSAIS16 Github https://bit.ly/2ISI3BS • Ditch the DataFrame's unstructured API and move over to Datasets typed API. • This enables the use of flatMapGroupsWithState • Now with a little effort you have full control over the aggregation strategy 25 Building a Discovery Engine Deep Dive: Aggregation Extension Options : Final Thoughts
  • 26. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 Next up: after the break Building a Discovery Engine: Part 2 1. Define the Problem (1m) 2. Build a plan of Attack (1m) 3. Model the Data (2m) 5. Build the Streaming Discovery Engine (~20m) 6. Hopefully time for lessons learned and Questions! 26
  • 28. #DDSAIS16 Github https://bit.ly/2ISI3BS Simply Connecting two endpoints is ripe for failure. 28
  • 29. #DDSAIS16 Github https://bit.ly/2ISI3BS 29 X Given Networks can go down at anytime.
  • 30. #DDSAIS16 Github https://bit.ly/2ISI3BS 30 We should build a system to auto-discover trends in (~real-time) for our KPMs and automatically report them...
  • 31. #DDSAIS16 Github https://bit.ly/2ISI3BS 31 Building a Discovery Engine End to End Architecture API raw events Streaming Aggregation metric aggs S3 ElasticTrend Processing Recent Metrics Historic Metrics events Historic Statistics <50ms response time *Generalized ESD or Grubbs & Distance Function
  • 32. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 32 Define the raw event • Must provide the information necessary for aggregation
  • 33. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 33 Define the raw event • Must provide the information necessary for aggregation • id must be idempotent. Brownie points if it can be reused in Hash Partitioner
  • 34. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 34 Define the raw event • Must provide the information necessary for aggregation • id must be idempotent. *dedup key - Brownie points if it can be reused in Hash Partitioner in spark and kafka! • event_dimensions + event_group can create great grouping hash
  • 35. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 35 Define the raw event • Allows for event evolution while supporting backwards compatibility.
  • 36. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 36 Define the raw event • Allows for event evolution while supporting backwards compatibility. • Enum ordinals can be used in ML jobs without the need for string encoding
  • 37. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 37 Define the aggregation • Includes logical time series Window • Stores all Dimensional and Statistic Information
  • 38. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 38 Define the aggregation • Includes logical time series Window • Stores all Dimensional and Statistic Information • Can be used together as a strong Binary Group Key
  • 39. #DDSAIS16 Github https://bit.ly/2ISI3BS Building a Discovery Engine 39 Why hash the dimensions? • Hashing the dimensions of the aggregation simplifies the groupBy or groupByKey mechanics. • Use guava since it ships with Spark and works well at scale for Hashing.
  • 40. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 Next up • Putting together what we've learned with Spark • Available in the content @ https://bit.ly/2ISI3BS • Walk through of critical components of the Application 40
  • 41. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 41 Building a Discovery Engine Creates the pipeline * process(KDSR), inject SparkSession (implicit*) * write back to Kafka * checkpoint, etc
  • 42. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 42 Building a Discovery Engine 1. Transform (protobuf - > Product) 2. Actual Aggregation * groupedBy BinaryType * fMGwS keyed by BinaryType * EventTimeTimeout*
  • 43. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 43 Building a Discovery Engine • StateSpec takes a StateFunc • This is for Key/Value pair grouped data • Key is [A]. It can be Binary. Cool • Take an Iterator[B] • Stores a GroupState[C] • Emits Iterator[D]
  • 44. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 44 Building a Discovery Engine metricAggregation
  • 45. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 45 Building a Discovery Engine • Data Sketches from Yahoo* provides serializable Quantile Sketches with Histogram capabilities. • Data Sketches are a light weight representation of a complete set of data (on a diet - think HyperLogLog) • We use this on our team to enable performant aggregation of complicated statistics without storing all rows in Spark memory.
  • 46. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 46 Building a Discovery Engine • Data Sketches from Yahoo* provides serializable Quantile Sketches with Histogram capabilities. • Data Sketches are a light weight representation of a complete set of data (on a diet - think HyperLogLog) • We use this on our team to enable performant aggregation of complicated statistics without storing all rows in Spark memory.
  • 47. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 47 Building a Discovery Engine • Testing should feel good. • Ensure you solve Encoder issues early • Make sure you don't add regression's into your 24/7/365 apps!
  • 48. #DDSAIS16 Github https://bit.ly/2ISI3BS 48 Building a Discovery Engine End to End Architecture API raw events Streaming Aggregation metric aggs S3 ElasticTrend Processing Recent Metrics Historic Metrics events Historic Statistics <50ms response time *Generalized ESD or Grubbs & Distance Function
  • 49. #DDSAIS16 Github https://bit.ly/2ISI3BS#DDSAIS16 Building the Continuous Trend Detector 1. Read from Kafka (raw events) 2. Look up *historic statistics from Elastic Search or Redis 3. Generate an anomaly stream based off of distance to statistical norm 4. Emit to Kafka for downstream use 49 Building a Discovery Engine Anatomy of the application: Streaming Aggregation Query
  • 50. #DDSAIS16 Github https://bit.ly/2ISI3BS Discovering trends in massive data sets doesn't come for free. There are core principals and ideologies that must be followed in order to extract interesting patterns from your data. 50
  • 51. #DDSAIS16 Github https://bit.ly/2ISI3BS Questions? / Thanks 51 @newfront github@newfront shaines@twilio.com Thanks to Marek Radonsky and Rohit Gupta from the Voice Insights team who helped make these patterns possible.