SlideShare une entreprise Scribd logo
1  sur  83
© 2017 MapR Technologies 1
Detecting Change
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 3
Who We Are
• MapR Technologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!
© 2017 MapR Technologies 4
Basic Outline
• Goal Setting
• Basic Ideas
– LLR (finding changes in counts)
– Poisson rate change detection (finding changes in events timing)
– Distribution estimation / visualization
– Labeled events and adding labels
• Free Improvisation on Themes
© 2017 MapR Technologies 5
Why Is This Practically Important
• The novice came to the master and says “something is broken”
© 2017 MapR Technologies 6
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
© 2017 MapR Technologies 7
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
• And the student was enlightened
© 2017 MapR Technologies 8
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
© 2017 MapR Technologies 9
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
• The master replied, “You have no question to ask. You have no
need of enlightenment”
© 2017 MapR Technologies 10
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
• The master replied, “You have no question to ask. You have no
need of enlightenment”
• And thus the student was enlightened
© 2017 MapR Technologies 11
• There are some very powerful techniques available, some only
very recently, that can make the detection of change much
easier than you might think. I will describe the practical use of
several of these techniques including t-digest, non-linear
histograms, variable rate Poisson models and combinations of
these.
© 2017 MapR Technologies 12
Comparing Counts
• Suppose we have two situations A and B, each with many
observations, nA and nB
• And some event x occurred n1A and n1B times in each situation
x other
A n1A nA - n1A
B n1B nB - n1B
© 2017 MapR Technologies 13
Comparing Counts
• Have we seen a change in the frequency of x?
• Frequency ratios?
– Breaks with small counts
• - test?
– Breaks with small counts
© 2017 MapR Technologies 14
Log-Likelihood Ratio Test (Root LLR)
• In R
entropy = function(k) {
-sum(k*log((k==0)+(k/sum(k))))
}
llr = function(k) {
(entropy(rowSums(k))+entropy(colSums(k))
-entropy(k))*2
}
• Like mutual information * 2 N
© 2017 MapR Technologies 15
Spot the Anomaly
• Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.89 1.95
4.51 14.29
© 2017 MapR Technologies 16
How Does it Work
Empirical fit to asymptotic
distribution is very good
© 2017 MapR Technologies 17
How Does it Work?
© 2017 MapR Technologies 18
OK
We can detect changes in counts
© 2017 MapR Technologies 19
Real-life Example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
© 2017 MapR Technologies 20
Real-life Example
© 2017 MapR Technologies 21
Example 2 - Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data
© 2017 MapR Technologies 22
Example 2 - Common Point of Compromise
skim exploit
Merchant 0
Skimmed
data
Merchant n
Card data is stolen
from Merchant 0
That data is used
in frauds at other
merchants
© 2017 MapR Technologies 23
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
© 2017 MapR Technologies 24
Detection Strategy
• Select histories that precede non-fraud
• And histories that precede fraud detection
• Analyze 2x2 cooccurrence of merchant n versus fraud
detection
© 2017 MapR Technologies 25
© 2017 MapR Technologies 26
What about the
real world?
© 2017 MapR Technologies 27
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys
© 2017 MapR Technologies 28
What about time?
© 2017 MapR Technologies 29
Finding Changes in Timing
• Suppose our input is events embedded in time
• Suppose we want to find changes in our input in real-time
• Waiting and counting is fine if we don’t have to react now
• We can do much better
© 2017 MapR Technologies 30
Poisson Event Rate Change
• Detection of fallout
– Time since last is very sensitive for complete failure
• Detection of change relative to reference
– Time since n-th most recent
– LLR with time
• Have to trade detection speed versus false positive rate and
size of change
• Can run multiple detectors at once
© 2017 MapR Technologies 31
Basic idea:
Time interval is better than counts
© 2017 MapR Technologies 32
Sporadic Events: Finding Normal and Anomalous Patterns
• Time between intervals is much more usable than absolute
times
• Counts don’t link as directly to probability models
• Time interval is log ρ
• This is a big deal
© 2017 MapR Technologies 33
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2017 MapR Technologies 34
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2017 MapR Technologies 35
In the real world,
event rates often vary
© 2017 MapR Technologies 36
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2017 MapR Technologies 37
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2017 MapR Technologies 38
Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold
– or we can pick a threshold empirically
Dt ~ le-lt
P(Dt > T) = e-lT
-logP(Dt > T) = lT
© 2017 MapR Technologies 39
After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile
© 2017 MapR Technologies 40
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2017 MapR Technologies 41
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2017 MapR Technologies 42
Seasonality Poses a Challenge
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000
© 2017 MapR Technologies 43
Something more is needed …
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02468
Christmas Traffic
Date
Hits/1000
© 2017 MapR Technologies 44
We need a better rate predictor…
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2017 MapR Technologies 45
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
© 2017 MapR Technologies 46
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
© 2017 MapR Technologies 47
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
• Let model see many lagged values
• Use L1 regularized linear model to pick important historical
values
– We would have moved to something fancier if this hadn’t worked
© 2017 MapR Technologies 48
A New Rate Predictor for Sporadic Events
© 2017 MapR Technologies 49
Improved Prediction with Adaptive Modeling
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29
02468
Christmas Prediction
Date
Hits(x1000)
© 2017 MapR Technologies 50
Some days the magic works
Some days ...
We use slightly different magic
© 2017 MapR Technologies 51
Detecting More Subtle Changes
• Time-since-last finds complete failures well
• Nth order time finds more subtle rate changes
• But that subtlety delays detection of complete failure
– First order delay has 99.9% confidence at 6.5 units
– 10th order delay has 99.9% confidence at 12.5 units
• But 10th order delay can find speedups, first order cannot
© 2017 MapR Technologies 57
10th order difference of
Poisson distribution
© 2017 MapR Technologies 58
Finding Changes in Time Series
• So far, we only have times
• What about when we have times and measurements together?
– These are called time-series!
• First step can be to discretize the measurement
– Quintiles or deciles are good candidates
– Multi-scale discretization is a fine thing to do
• That gives us arrival times for measurements in each bin
– And this is susceptible to the rate model on previous slides
© 2017 MapR Technologies 59
Finding Changes in Time Series
• Comprehensive approaches also possible (for counts)
• Time aware variant of G-test is possible
vs
Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March
1993)
http://bit.ly/surprise-and-coincidence
© 2017 MapR Technologies 60
Propagation Anomalies
• What happens when something shadows part of the coverage
field for mobile telecom?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)
© 2017 MapR Technologies 61
© 2017 MapR Technologies 62
© 2017 MapR Technologies 63
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.
© 2017 MapR Technologies 64
Other Issues
• Finding changes in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs
© 2017 MapR Technologies 65
Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2017 MapR Technologies 66
Tower Coverage Areas
© 2017 MapR Technologies 67
Just One Tower
© 2017 MapR Technologies 68
Cluster Reports for That Tower
© 2017 MapR Technologies 69
Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
Can also sub-divide each cluster
into signal strength ranges
Multiple scales of clustering
can also be used to trade off
geographic versus temporal
resolution
© 2017 MapR Technologies 70
Example
0.00.51.01.5
dt
01234567
dt
0.00.20.40.6
dt
Each cluster gives us a
sequence of events.
Individual anomaly scores can
be scaled and added to get
composite anomaly score
Optimality of combined signal
derives from optimality of
components.
© 2017 MapR Technologies 71
Characterizing Distributions
• What about sequences of values from arbitrary distributions
– Can we find changes in the distribution?
– For instance, what about latencies?
• Non-linear histogram - FloatHistogram
• Fully Adaptive histogram – t-digest
© 2017 MapR Technologies 72
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
© 2017 MapR Technologies 73
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
© 2017 MapR Technologies 74
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
© 2017 MapR Technologies 75
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 76
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 77
Finding Change with Histograms
• With fixed bins, we can simply count and compare counts for
different bins
• Thus, histogram change reduces to count change
• Or to changes in event times
© 2017 MapR Technologies 78
Visualizing Histograms
• We want to detect small changes
– Consider log-scale for Y
• Non-linear bin spacing is really good for increasing counts
– Reweight by bin-width
– Changing x axis changes y axis
© 2017 MapR Technologies 79
Good Results
© 2017 MapR Technologies 80
Bad Results
© 2017 MapR Technologies 81
Bad Results
© 2017 MapR Technologies 82
With Better Scaling
© 2017 MapR Technologies 83
Bad Results
© 2017 MapR Technologies 84
© 2017 MapR Technologies 85
With FloatHistogram
© 2017 MapR Technologies 86
Summary
• Counts – LLR
• Events – Poisson + nth-order diffs
• Decimate in space
• Decimate in measurement space
– t-digest, FloatHistogram
• Don’t forget visualization
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 87
Q & A
© 2017 MapR Technologies 88
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning

Contenu connexe

Tendances

Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendationsTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 

Tendances (20)

Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 

Similaire à Finding Changes in Real Data

Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouMatt Stubbs
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupAlan Iovine
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceRon Bodkin
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Big Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data FabricBig Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data FabricMatt Stubbs
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 

Similaire à Finding Changes in Real Data (20)

Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetup
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligence
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Big Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data FabricBig Data LDN 2017: Real World Impact of a Global Data Fabric
Big Data LDN 2017: Real World Impact of a Global Data Fabric
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 

Plus de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 

Plus de Ted Dunning (7)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Dernier

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Dernier (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Finding Changes in Real Data

  • 1. © 2017 MapR Technologies 1 Detecting Change
  • 2. © 2017 MapR Technologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning
  • 3. © 2017 MapR Technologies 3 Who We Are • MapR Technologies – We make a kick-ass platform for big data computing – Support many workloads including Hadoop / Spark / HPC / Other – Extended to allow streams and tables in basic platform – Free for academic research / training • Apache Software Foundation – Culture hub for building open source communities – Shared values around openness for contribution as well as use – Many major projects are part of Apache – Even more minor ones!
  • 4. © 2017 MapR Technologies 4 Basic Outline • Goal Setting • Basic Ideas – LLR (finding changes in counts) – Poisson rate change detection (finding changes in events timing) – Distribution estimation / visualization – Labeled events and adding labels • Free Improvisation on Themes
  • 5. © 2017 MapR Technologies 5 Why Is This Practically Important • The novice came to the master and says “something is broken”
  • 6. © 2017 MapR Technologies 6 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?”
  • 7. © 2017 MapR Technologies 7 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?” • And the student was enlightened
  • 8. © 2017 MapR Technologies 8 The Second Student • Another student said to the master, “I see something has changed … something may have broken”
  • 9. © 2017 MapR Technologies 9 The Second Student • Another student said to the master, “I see something has changed … something may have broken” • The master replied, “You have no question to ask. You have no need of enlightenment”
  • 10. © 2017 MapR Technologies 10 The Second Student • Another student said to the master, “I see something has changed … something may have broken” • The master replied, “You have no question to ask. You have no need of enlightenment” • And thus the student was enlightened
  • 11. © 2017 MapR Technologies 11 • There are some very powerful techniques available, some only very recently, that can make the detection of change much easier than you might think. I will describe the practical use of several of these techniques including t-digest, non-linear histograms, variable rate Poisson models and combinations of these.
  • 12. © 2017 MapR Technologies 12 Comparing Counts • Suppose we have two situations A and B, each with many observations, nA and nB • And some event x occurred n1A and n1B times in each situation x other A n1A nA - n1A B n1B nB - n1B
  • 13. © 2017 MapR Technologies 13 Comparing Counts • Have we seen a change in the frequency of x? • Frequency ratios? – Breaks with small counts • - test? – Breaks with small counts
  • 14. © 2017 MapR Technologies 14 Log-Likelihood Ratio Test (Root LLR) • In R entropy = function(k) { -sum(k*log((k==0)+(k/sum(k)))) } llr = function(k) { (entropy(rowSums(k))+entropy(colSums(k)) -entropy(k))*2 } • Like mutual information * 2 N
  • 15. © 2017 MapR Technologies 15 Spot the Anomaly • Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.89 1.95 4.51 14.29
  • 16. © 2017 MapR Technologies 16 How Does it Work Empirical fit to asymptotic distribution is very good
  • 17. © 2017 MapR Technologies 17 How Does it Work?
  • 18. © 2017 MapR Technologies 18 OK We can detect changes in counts
  • 19. © 2017 MapR Technologies 19 Real-life Example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 20. © 2017 MapR Technologies 20 Real-life Example
  • 21. © 2017 MapR Technologies 21 Example 2 - Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  • 22. © 2017 MapR Technologies 22 Example 2 - Common Point of Compromise skim exploit Merchant 0 Skimmed data Merchant n Card data is stolen from Merchant 0 That data is used in frauds at other merchants
  • 23. © 2017 MapR Technologies 23 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  • 24. © 2017 MapR Technologies 24 Detection Strategy • Select histories that precede non-fraud • And histories that precede fraud detection • Analyze 2x2 cooccurrence of merchant n versus fraud detection
  • 25. © 2017 MapR Technologies 25
  • 26. © 2017 MapR Technologies 26 What about the real world?
  • 27. © 2017 MapR Technologies 27 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Really truly bad guys
  • 28. © 2017 MapR Technologies 28 What about time?
  • 29. © 2017 MapR Technologies 29 Finding Changes in Timing • Suppose our input is events embedded in time • Suppose we want to find changes in our input in real-time • Waiting and counting is fine if we don’t have to react now • We can do much better
  • 30. © 2017 MapR Technologies 30 Poisson Event Rate Change • Detection of fallout – Time since last is very sensitive for complete failure • Detection of change relative to reference – Time since n-th most recent – LLR with time • Have to trade detection speed versus false positive rate and size of change • Can run multiple detectors at once
  • 31. © 2017 MapR Technologies 31 Basic idea: Time interval is better than counts
  • 32. © 2017 MapR Technologies 32 Sporadic Events: Finding Normal and Anomalous Patterns • Time between intervals is much more usable than absolute times • Counts don’t link as directly to probability models • Time interval is log ρ • This is a big deal
  • 33. © 2017 MapR Technologies 33 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 34. © 2017 MapR Technologies 34 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 35. © 2017 MapR Technologies 35 In the real world, event rates often vary
  • 36. © 2017 MapR Technologies 36 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 37. © 2017 MapR Technologies 37 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 38. © 2017 MapR Technologies 38 Poisson Distribution • Time between events is exponentially distributed • This means that long delays are exponentially rare • If we know λ we can select a good threshold – or we can pick a threshold empirically Dt ~ le-lt P(Dt > T) = e-lT -logP(Dt > T) = lT
  • 39. © 2017 MapR Technologies 39 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  • 40. © 2017 MapR Technologies 40 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 41. © 2017 MapR Technologies 41 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 42. © 2017 MapR Technologies 42 Seasonality Poses a Challenge Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  • 43. © 2017 MapR Technologies 43 Something more is needed … Nov 17 Nov 27 Dec 07 Dec 17 Dec 27 02468 Christmas Traffic Date Hits/1000
  • 44. © 2017 MapR Technologies 44 We need a better rate predictor… Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 45. © 2017 MapR Technologies 45 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive
  • 46. © 2017 MapR Technologies 46 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so
  • 47. © 2017 MapR Technologies 47 Idea: Predict log(rate) from lagged log(rate) • Predict log because – Peak to valley ratio – Traffic grew by 30 % – All rates are positive – Just because I said so • Let model see many lagged values • Use L1 regularized linear model to pick important historical values – We would have moved to something fancier if this hadn’t worked
  • 48. © 2017 MapR Technologies 48 A New Rate Predictor for Sporadic Events
  • 49. © 2017 MapR Technologies 49 Improved Prediction with Adaptive Modeling Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29 02468 Christmas Prediction Date Hits(x1000)
  • 50. © 2017 MapR Technologies 50 Some days the magic works Some days ... We use slightly different magic
  • 51. © 2017 MapR Technologies 51 Detecting More Subtle Changes • Time-since-last finds complete failures well • Nth order time finds more subtle rate changes • But that subtlety delays detection of complete failure – First order delay has 99.9% confidence at 6.5 units – 10th order delay has 99.9% confidence at 12.5 units • But 10th order delay can find speedups, first order cannot
  • 52. © 2017 MapR Technologies 57 10th order difference of Poisson distribution
  • 53. © 2017 MapR Technologies 58 Finding Changes in Time Series • So far, we only have times • What about when we have times and measurements together? – These are called time-series! • First step can be to discretize the measurement – Quintiles or deciles are good candidates – Multi-scale discretization is a fine thing to do • That gives us arrival times for measurements in each bin – And this is susceptible to the rate model on previous slides
  • 54. © 2017 MapR Technologies 59 Finding Changes in Time Series • Comprehensive approaches also possible (for counts) • Time aware variant of G-test is possible vs Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March 1993) http://bit.ly/surprise-and-coincidence
  • 55. © 2017 MapR Technologies 60 Propagation Anomalies • What happens when something shadows part of the coverage field for mobile telecom? – Can happen in urban areas with a construction crane • Can solve heuristically – Subtract from reference image composed by long term averages – Doesn’t deal well with weak signal regions and low S/N • Can solve probabilistically – Compute anomaly for each measurement, use mean of log(p)
  • 56. © 2017 MapR Technologies 61
  • 57. © 2017 MapR Technologies 62
  • 58. © 2017 MapR Technologies 63 Variable Signal/Noise Makes Heuristic Tricky Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
  • 59. © 2017 MapR Technologies 64 Other Issues • Finding changes in coverage area is similar tricky • Coverage area is roughly where tower signal strength is higher than neighbors • Except for fuzziness due to hand-off delays • Except for bias due to large-scale caller motions – Rush hour – Event mobs
  • 60. © 2017 MapR Technologies 65 Simple Answer for Propagation Anomalies • Cluster signal strength reports • Cluster locations using k-means, large k • Model report rate anomaly using discrete event models • Model signal strength anomaly using percentile model • Trade larger k against higher report rates, faster detection • Overall anomaly is sum of individual log(p) anomalies
  • 61. © 2017 MapR Technologies 66 Tower Coverage Areas
  • 62. © 2017 MapR Technologies 67 Just One Tower
  • 63. © 2017 MapR Technologies 68 Cluster Reports for That Tower
  • 64. © 2017 MapR Technologies 69 Cluster Reports for That Tower 1 2 3 4 5 6 7 8 9 Can also sub-divide each cluster into signal strength ranges Multiple scales of clustering can also be used to trade off geographic versus temporal resolution
  • 65. © 2017 MapR Technologies 70 Example 0.00.51.01.5 dt 01234567 dt 0.00.20.40.6 dt Each cluster gives us a sequence of events. Individual anomaly scores can be scaled and added to get composite anomaly score Optimality of combined signal derives from optimality of components.
  • 66. © 2017 MapR Technologies 71 Characterizing Distributions • What about sequences of values from arbitrary distributions – Can we find changes in the distribution? – For instance, what about latencies? • Non-linear histogram - FloatHistogram • Fully Adaptive histogram – t-digest
  • 67. © 2017 MapR Technologies 72 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps • Relative error is bounded in measurement space
  • 68. © 2017 MapR Technologies 73 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps • Relative error is bounded in measurement space • Bin index can be computed using FP representation!
  • 69. © 2017 MapR Technologies 74 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1
  • 70. © 2017 MapR Technologies 75 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 71. © 2017 MapR Technologies 76 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 • Interpolate using centroids in x • Very good near extremes, no dynamic allocation 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 72. © 2017 MapR Technologies 77 Finding Change with Histograms • With fixed bins, we can simply count and compare counts for different bins • Thus, histogram change reduces to count change • Or to changes in event times
  • 73. © 2017 MapR Technologies 78 Visualizing Histograms • We want to detect small changes – Consider log-scale for Y • Non-linear bin spacing is really good for increasing counts – Reweight by bin-width – Changing x axis changes y axis
  • 74. © 2017 MapR Technologies 79 Good Results
  • 75. © 2017 MapR Technologies 80 Bad Results
  • 76. © 2017 MapR Technologies 81 Bad Results
  • 77. © 2017 MapR Technologies 82 With Better Scaling
  • 78. © 2017 MapR Technologies 83 Bad Results
  • 79. © 2017 MapR Technologies 84
  • 80. © 2017 MapR Technologies 85 With FloatHistogram
  • 81. © 2017 MapR Technologies 86 Summary • Counts – LLR • Events – Poisson + nth-order diffs • Decimate in space • Decimate in measurement space – t-digest, FloatHistogram • Don’t forget visualization Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 82. © 2017 MapR Technologies 87 Q & A
  • 83. © 2017 MapR Technologies 88 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @ted_dunning

Notes de l'éditeur

  1. Talk track: This is what it looks like to have events such as those on website that come in at randomized times (people come when they want to) but the underlying average rate in this case is constant, in other words, a fairly steady stream of traffic. This looks at lot like the first signal we talked about: a randomized but even signal… We can use t-digest on it to set thresholds, everything works just grand. (Like radio activity Geiger counter clicks)
  2. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  3. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  4. Talk track: This slide is here for reference when you download the slides
  5. Ted: this was figure 5-2 in the book
  6. Talk track: You need a rate predictor Ellen: sometimes simple is good enough
  7. Ted: This was figure 5.4
  8. Ted: This was figure 5.4
  9. Ted: this was figure 5-2 in the book
  10. We can look at yesterday and day before but need to look at the shape from previous days … but look at today for whether traffic is scaling
  11. Ted: This was figure 5.4