In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.
6. 6
Large
variety
of
media
Blogs,
reviews,
news
arAcles,
streaming
content
> 500M
Tweets
everyday
Challenge: Surfacing Relevant Content
Explosive Content Creation
[1]
hPp://www.kpcb.com/blog/2014-‐internet-‐trends
> 300 hrs
Video
uploaded
every
minute
> 1.8 B
Photos
uploaded
online
in
2014
[1]
7. 7
High Volume
Content Consumption
WhatsApp
Messages
per
day
[1]
Pandora
Listener
hours
(Q2
2015)
[3]
Skype
Calls
per
month
E-mails
Per
second
Google
Searches
/year
[2]
Netflix
Hours
streamed
per
month
>30B
5.3B
4.76B
>
1T
>2.2M
>
1B
!
É
[1]
hPps://www.facebook.com/jan.koum/posts/10152994719980011?pnref=story
[2]
hPp://searchengineland.com/google-‐1-‐trillion-‐searches-‐per-‐year-‐212940
[3]
hPp://press.pandora.com/phoenix.zhtml?c=251764&p=irol-‐newsArAcle&ID=2070623
]
9
8. 8
A New World
Mobile, Mobile, Mobile
5.4
B
Mobile
Phone
Users
[1]
69%
Y/Y
Growth
Data
Traffic
55%
Mobile
Video
Traffic
34%
Global
e-‐Commerce
[2]
AVAILABILITY
PERFORMANCE
RELIABILITY
Anywhere, Anytime, Any Device
Smartphone
Subscrip`ons
in
2014
[1]
2.1B
[1]
hPp://www.kpcb.com/blog/2015-‐internet-‐trends
[2]
hPp://www.criteo.com/media/1894/criteo-‐state-‐of-‐mobile-‐commerce-‐q1-‐2015-‐ppt.pdf
f
K
.
9. 9
Market pulse
Finance/Investing
[1]
Image
borrowed
from
hPp://www.bloomberg.com/bw/arAcles/2013-‐06-‐06/how-‐the-‐robots-‐lost-‐high-‐frequency-‐tradings-‐rise-‐and-‐fall
[2]
hPp://arAcles.economicAmes.indiaAmes.com/2014-‐12-‐26/news/57420480_1_ravi-‐varanasi-‐mobile-‐plaeorm-‐nse
1
minute
bids
and
offers
March
8,
2011
[1]
Mobile
trading
on
the
rise
[2]
NSE
48%
increase
in
turnover,
Jan’14
-‐>
Dec’14
BSE
0.25%
(Jan’14)
-‐>
0.5%
(Nov’14)
of
total
volume
10. 10
Entertainment: MMOs
Game of War
Largest single world concurrent mobile game in the world
“Real-‐`me
Many-‐to-‐Many
is
Tomorrow's
Internet”
-‐
Francois
Orsini
-‐
Global scale
CollaboraAve:
make
alliances
Real-time messaging
Chat
translaAon
in
mulAple
languages
11. 11
On
the rise
Cybersecurity
2014
Staples
Dec’14
JP
Morgan
Oct’14
New
York
July’14
Michaels
Jan’14
PF
Changs
June’14
Home
Depot
Sept’14
UPS
Aug’14
Sony
Nov’14
OPM,
Anthem,
UCLA
2015
2015
[1]
hPp://www.mcafee.com/us/resources/reports/rp-‐economic-‐impact-‐cybercrime2.pdf
400 B [1]
12. 12
Supporting higher volume and speed
Hardware Innovations
Massively parallel
Intel’s “Knights Landing” Xeon Phi - 72 cores [1]
High speed
Low Power
“…
quickly
idenAfy
fraud
detecAon
paPerns
in
financial
transacAons;
healthcare
researchers
could
process
and
analyze
larger
data
sets
in
real
Ame,
acceleraAng
complex
tasks
such
as
geneAc
analysis
and
disease
tracking.”
[3]
Intel and Micron’s 3D XPoint Technology
1000x faster than NAND
[1]
hPp://www.anandtech.com/show/9436/quick-‐note-‐intel-‐knights-‐landing-‐xeon-‐phi-‐omnipath-‐100-‐isc-‐2015
[2]
Intel
IDS’15
[3]
hPp://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-‐and-‐micron-‐produce-‐breakthrough-‐memory-‐technology
[2]
Q
13. 13
Hardware support for apps
Hardware Innovations
[1]
Images
borrowed
from
Julius
Madelblat’s
and
Andy
Vargas,
Rajeev
Nalawadi
and
Shane
Abreu’s
Technology
Insight
at
IDF’15.
Image and Touch processing support in Intel’s Skylake [1]
15. 15
Real time
User Experience, Productivity
Real-time Video Streams
N E W S
Drones Robotics
I N D U S T R Y
$ 4 0
B
b y
2 0 2 0
[ 3 ]
D E L I V E R Y / M O N i T O R I N G
$ 1 . 7 B
f o r
2 0 1 5 [ 1 ]
[1]
hPp://www.kpcb.com/blog/2015-‐internet-‐trends
[2]
hPp://www.bostondynamics.com/robot_Atlas.html
[3]
hPp://www.marketsandmarkets.com/Market-‐Reports/Industrial-‐RoboAcs-‐Market-‐643.html
[2]
16. 16
$1.9
T
in
value
by
2020
-‐
Mfg
(15%),
Health
Care
(15%),
Insurance
(11%)
26
B
-‐
75
B
units
[2,
3,
4,
5]
[1]
Background
image
taken
from
hPps://www.uspsoig.gov/sites/default/files/document-‐library-‐files/2015/rarc-‐wp-‐15-‐013.pdf
[2]
hPp://www.gartner.com/newsroom/id/2636073
[3]
hPps://www.abiresearch.com/press/more-‐than-‐30-‐billion-‐devices-‐will-‐wirelessly-‐conne
[4]
hPp://newsroom.cisco.com/feature-‐content?type=webcontent&arAcleId=1208342
[5]
hPp://www.businessinsider.com/75-‐billion-‐devices-‐will-‐be-‐connected-‐to-‐the-‐internet-‐by-‐2020-‐2013-‐10
[6]
hPps://www.abiresearch.com/press/ibeaconble-‐beacon-‐shipments-‐to-‐break-‐60-‐million-‐by/
Improve
operaAonal
efficiencies,
customer
experience,
new
business
modelsY
Beacons:
Retailers
and
bank
branches
60M
units
market
by
2019
[6]
Smart
buildings:
Reduce
energy
costs,
cut
maintenance
costs
Increase
safety
&
security
Large Market Potential
Internet of Things
17. 17
The Future
Biostamps [2]
Mobile
Sensor Network
Exponential growth [1]
[1]
hPp://opensignal.com/assets/pdf/reports/2015_08_fragmentaAon_report.pdf
[2]
hPp://www.ericsson.com/thinkingahead/networked_society/stories/#/film/mc10-‐biostamp
18. 18
Continuous Monitoring
Intelligent Health Care
Tracking Movements
Measure
effect
of
social
influences
Google Lens
Measure
glucose
level
in
tears
Watch/Wristband
Smart Textiles
Skin
temperature
PerspiraAon
Ingestible Sensors
MedicaAon
compliance
[1]
Heart
funcAon
[1]
hPp://www.hhnmag.com/Magazine/2015/Apr/cover-‐medical-‐technology
!
!
19. 19
Connected World
Internet of Things
30
B
connected
devices
by
2020
Health Care
153
Exabytes
(2013)
-‐>
2314
Exabytes
(2020)
Machine Data
40%
of
digital
universe
by
2020
Connected Vehicles
Data
transferred
per
vehicle
per
month
4
MB
-‐>
5
GB
Digital Assistants (Predictive Analytics)
$2B
(2012)
-‐>
$6.5B
(2019)
[1]
Siri/Cortana/Google
Now
Augmented/Virtual Reality
$150B
by
2020
[2]
Oculus/HoloLens/Magic
Leap
Ñ
!+
>
[1]
hPp://www.siemens.com/innovaAon/en/home/pictures-‐of-‐the-‐future/digitalizaAon-‐and-‐so{ware/digital-‐assistants-‐trends.html
[2]
hPp://techcrunch.com/2015/04/06/augmented-‐and-‐virtual-‐reality-‐to-‐hit-‐150-‐billion-‐by-‐2020/#.7q0heh:oABw
21. 21
What is Analytics?
According to wikipedia
DISCOVERY
Ability
to
idenAfy
paPerns
in
data
COMMUNICATION
Provide
insights
in
a
meaningful
way
"
"
22. 22
Types of Analytics
" E
CUBE ANALYTICS
Business
Intelligence
PREDICTIVE ANALYTICS
StaAsAcs
and
Machine
learning
23. 23
What is Real-Time Analytics?
BATCH
high throughput
> 1 hour
monthly active users
relevance for ads
adhoc
queries
NEAR
REAL TIME
low latency
< 1 ms
Financial
Trading
ad impressions count
hash tag trends
approximate
> 1 sec
Online
Non-Transactional
latency sensitive
< 500 ms
fanout Tweets
search for Tweets
deterministic
workflows
Online
Transactional
It’s contextual
28. 28
It’s different
Key Characteristics
FAULT TOLERANCE [1]
A V A I L A B I L I T Y
SCALE OUT
H I G H
P E R F O R M A N C E
ROBUST
I N C O M P L E T E
D A T A
[1]
ByzanAne
failures
are
described
in
the
following
journal
paper:
J.
Driscoll,
Kevin;
Hall,
Brendan;
Sivencrona,
Håkan;
Zumsteg,
Phil
(2003).
"ByzanAne
Fault
Tolerance,
from
Theory
to
Reality"
2788.
pp.
235–248.
33. 33
Sampling
Obtain
a
representaAve
sample
from
a
data
stream
Maintain
dynamic
sample
A
data
stream
is
a
conAnuous
process
Not
known
in
advance
how
many
points
may
elapse
before
an
analyst
may
need
to
use
a
representaAve
sample
Reservoir
sampling
[1]
ProbabilisAc
inserAons
and
deleAons
on
arrival
of
new
stream
points
Probability
of
successive
inserAon
of
new
points
reduces
with
progression
of
the
stream
An
unbiased
sample
contains
a
larger
and
larger
fracAon
of
points
from
the
distant
history
of
the
stream
PracAcal
perspecAve
Data
stream
may
evolve
and
hence,
the
majority
of
the
points
in
the
sample
may
represent
the
stale
history
[1]
J.
S.
ViPer.
Random
Sampling
with
a
Reservoir.
ACM
TransacAons
on
MathemaAcal
So{ware,
Vol.
11(1):37–57,
March
1985.
34. 34
Sampling
Sliding
window
approach
(sample
size
k,
window
width
n)
Sequence-‐based
Replace
expired
element
with
newly
arrived
element
Disadvantage:
highly
periodic
Chain-‐sample
approach
Select
element
ith
with
probability
Min(i,n)/n
Select
uniformly
at
random
an
index
from
[i+1,
i+n]
of
the
element
which
will
replace
the
ith
item
Maintain
k
independent
chain
samples
Timestamp-‐based
#
elements
in
a
moving
window
may
vary
over
Ame
Priority-‐sample
approach
[1]
B.
Babcock.
Sampling
From
a
Moving
Window
Over
Streaming
Data.
In
Proceedings
of
SODA,
2002.
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
35. 35
Sampling
Biased
Reservoir
Sampling
[1]
Use
a
temporal
bias
funcAon
-‐
recent
points
have
higher
probability
of
being
represented
in
the
sample
reservoir
Memory-‐less
bias
funcAons
Future
probability
of
retaining
a
current
point
in
the
reservoir
is
independent
of
its
past
history
or
arrival
Ame
Probability
of
an
rth
point
belonging
to
the
reservoir
at
the
Ame
t
is
proporAonal
to
the
bias
funcAon
ExponenAal
bias
funcAons
for
rth
data
point
at
Ame
t,
where,
r
≤
t,
λ
[0,
1]
is
the
bias
rate
Maximum
reservoir
requirement
R(t)
is
bounded
[1]
C.
C.
Aggarwal.On
Biased
Reservoir
Sampling
in
the
presence
of
Stream
EvoluAon.
in
Proceedings
of
VLDB,
2006.
36. 36
Sampling
General problem
Input:
Tuples
of
n
components
Subset
are
key
components
-‐
basis
for
sampling
Sample
of
size
a/b
Hash
key
to
b
buckets
Accept
a
tuple
if
hash
value
<
a
Space
constraint
a
<-‐
a
-‐
1
Remove
tuples
whose
keys
hash
to
a
37. 37
Set Membership
Filtering
Determine,
with
some
false
probability,
if
an
item
in
a
data
stream
has
been
seen
before
Databases
(e.g.,
speed
up
semi-‐join
operaAons),
Caches,
Routers,
Storage
Systems
Reduce
space
requirement
in
probabilisAc
rouAng
tables
Speedup
longest-‐prefix
matching
of
IP
addresses
Encode
mulAcast
forwarding
informaAon
in
packets
Summarize
content
to
aid
collaboraAons
in
overlay
and
peer-‐to-‐peer
networks
Improve
network
state
management
and
monitoring
38. 38
Set Membership
Filtering
[1]
IllustraAon
borrowed
from
hPp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
ApplicaAon
to
hyphenaAon
programs
Early
UNIX
spell
checkers
39. 39
Set Membership
Filtering
Natural
generalizaAon
of
hashing
False
posiAves
are
possible
No
false
negaAves
No
deleAons
allowed
For
false
posiAve
rate
ε,
#
hash
funcAons
=
log2(1/ε)
where,
n
=
#
elements,
k
=
#
hash
funcAons
m
=
#
bits
in
the
array
40. 40
Set Membership
Filtering
Minimizing
false
posiAve
rate
ε
w.r.t.
k
[1]
k
=
ln
2
*
(m/n)
ε
=
(1/2)k
≈
(0.6185)m/n
1.44
*
log2(1/ε)
bits
per
item
Independent
of
item
size
or
#
items
InformaAon-‐theoreAc
minimum:
log2(1/ε)
bits
per
item
44%
overhead
X
=
#
0
bits
where
[1]
A.
Broder
and
M.
Mitzenmacher.
Network
ApplicaAons
of
Bloom
Filters:
A
Survey.
In
Internet
MathemaAcs
Vol.
1,
No.
4,
2005.
41. 41
Set Membership
Filtering
DerivaAves
CounAng
Bloom
filters:
Support
deleAon
Bit
-‐>
small
counter
Typically,
4
bits
per
counter
suffice
Increment,
Decrement
Blocked
Bloom
filters
d-‐le{
CounAng
Bloom
filters
QuoAent
filters
Rank-‐Indexed
Hashing
42. 42
Set Membership
Filtering
Cuckoo Filter [1]
Key
Highlights
Add
and
remove
items
dynamically
For
false
posiAve
rate
ε
<
3%,
more
space
efficient
than
Bloom
filter
Higher
performance
than
Bloom
filter
for
many
real
workloads
AsymptoAcally
worse
performance
than
Bloom
filter
Min
fingerprint
size
α
log
(#
entries
in
table)
Overview
Stores
only
a
fingerprint
of
an
item
inserted
Original
key
and
value
bits
of
each
item
not
retrievable
Set
membership
query
for
item
x:
search
hash
table
for
fingerprint
of
x
[1]
Fan
et
al.,
Cuckoo
Filter:
PracAcally
BePer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
InternaAonal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.
43. 43
Set Membership
Filtering
[1]
R.
Pagh
and
F.
Rodler.
Cuckoo
hashing.
Journal
of
Algorithms,
51(2):122-‐144,
2004.
[2]
IllustraAon
borrowed
from
“Fan
et
al.,
Cuckoo
Filter:
PracAcally
BePer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
InternaAonal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.”
[2]
IllustraAon
of
Cuckoo
hashing
[2]
Cuckoo Hashing [1]
High
space
occupancy
PracAcal
implementaAons:
mulAple
items/bucket
Example
uses:
So{ware-‐based
Ethernet
switches
Cuckoo Filter
Uses
a
mulA-‐way
associaAve
Cuckoo
hash
table
Employs
parAal-‐key
cuckoo
hashing
Relocate
exisAng
fingerprints
to
their
alternaAve
locaAons
[2]
44. 44
Set Membership
Filtering
Cuckoo Filter
ParAal-‐key
cuckoo
hashing
Fingerprint
hashing
ensures
uniform
distribuAon
of
items
in
the
table
Length
of
fingerprint
<<
Size
of
h1
or
h2
Possible
to
have
mulAple
entries
of
a
fingerprint
in
a
bucket
DeleAon
Item
must
have
been
previously
inserted
Comparison
45. 45
Estimating Cardinality
Large
set
of
real-‐world
applica`ons
Database
systems/Search
engines
#
disAnct
queries
Network
monitoring
applicaAons
Natural
language
processing
#
disAnct
moAfs
in
a
DNA
sequence
#
disAnct
elements
of
RFID/sensor
networks
# Distinct Elements
46. 46
Estimating Cardinality
Historical
context
ProbabilisAc
counAng
[Flajolet
and
MarAn,
1983]
LogLog
counAng
[Durand
and
Flajolet,
2003]
HyperLogLog
[Flajolet
et
al.,
2007]
Sliding
HyperLogLog
[Chabchoub
and
Hebrail,
2010]
HyperLogLog
in
PracAce
[Heule
et
al.,
2013]
Self-‐Organizing
Bitmap
[Chen
and
Cao,
2009]
Discrete
Max-‐Count
[Ting,
2014]
Sequence
of
sketches
forms
a
Markov
chain
when
h
is
a
strong
universal
hash
EsAmate
cardinality
using
a
marAngale
# Distinct Elements
N
≤
109
47. 47
Estimating Cardinality
Hyperloglog
Apply
hash
funcAon
h
to
every
element
in
a
mulAset
Cardinality
of
mulAset
is
2max(ϱ)
where
0ϱ-‐11
is
the
bit
paPern
observed
at
the
beginning
of
a
hash
value
Above
suffers
with
high
variance
Employ
stochasAc
averaging
ParAAon
input
stream
into
m
sub-‐streams
Si
using
first
p
bits
of
hash
values
(m
=
2p)
# Distinct Elements
where
48. 48
Estimating Cardinality
Hyperloglog
in
Prac`ce:
Op`miza`ons
Use
of
64-‐bit
hash
funcAon
Total
memory
requirement
5
*
2p
-‐>
6
*
2p,
where
p
is
the
precision
Empirical
bias
correcAon
Uses
empirically
determined
data
for
cardinaliAes
smaller
than
5m
and
uses
the
unmodified
raw
esAmate
otherwise
Sparse
representaAon
For
n≪m,
store
an
integer
obtained
by
concatenaAng
the
bit
paPerns
for
idx
and
ϱ(w)
Use
variable
length
encoding
for
integers
that
uses
variable
number
of
bytes
to
represent
integers
Use
difference
encoding
-‐
store
the
difference
between
successive
elements
Other
opAmizaAons
[1,
2]
# Distinct Elements
[1]
hPp://druid.io/blog/2014/02/18/hyperloglog-‐opAmizaAons-‐for-‐real-‐world-‐systems.html
[2]
hPp://anArez.com/news/75
49. 49
Estimating Cardinality
Self-‐Learning
Bitmap
(S-‐bitmap)
[1]
Achieve
constant
relaAve
esAmaAon
errors
for
unknown
cardinaliAes
in
a
wide
range,
say
from
10s
to
>106
Bitmap
obtained
via
adapAve
sampling
process
Bits
corresponding
to
the
sampled
items
are
set
to
1
Sampling
rates
are
learned
from
#
disAnct
items
already
passed
and
reduced
sequenAally
as
more
bits
are
set
to
1
For
given
input
parameters
Nmax
and
esAmaAon
precision
ε,
size
of
bit
mask
For
r
=
1
-‐2ε2(1+ε2)-‐1
and
sampling
probability
pk
=
m
(m+1-‐k)-‐1(1+ε2)rk,
where
k
∈
[1,m]
RelaAve
error
≣
ε
# Distinct Elements
[1]
Chen
et
al.
“DisAnct
counAng
with
a
self-‐learning
bitmap”.
Journal
of
the
American
StaAsAcal
AssociaAon,
106(495):879–890,
2011.
50. 50
Estimating Quantiles
Large
set
of
real-‐world
applica`ons
Database
applicaAons
Sensor
networks
OperaAons
ProperAes
Provide
tunable
and
explicit
guarantees
on
the
precision
of
approximaAon
Single
pass
Early
work
[Greenwald
and
Khanna,
2001]
-‐
worst
case
space
requirement
[Arasu
and
Manku,
2004]
-‐
sliding
window
based
model,
worst
case
space
requirement
Quantiles, Histograms, Icebergs
51. 51
Estimating Quantiles
q-‐digest
[1]
Groups
values
in
variable
size
buckets
of
almost
equal
weights
Unlike
a
tradiAonal
histogram,
buckets
can
overlap
Key
features
Detailed
informaAon
about
frequent
values
preserved
Less
frequent
values
lumped
into
larger
buckets
Using
message
of
size
m,
answer
within
an
error
of
Except
root
and
leaf
nodes,
a
node
v
∈
q-‐digest
iff
Quantiles, Histograms, Icebergs
[1]
Shrivastava
et
al.,
Medians
and
Beyond:
New
AggregaAon
Techniques
for
Sensor
Networks.
In
Proceedings
of
SenSys,
2004.
Max
signal
value
#
Elements
Compression
Factor
Complete
binary
tree
52. 52
Estimating Quantiles
q-‐digest
Building
a
q-‐digest
q-‐digests
can
be
constructed
in
a
distributed
fashion
Merge
q-‐digests
Quantiles, Histograms, Icebergs
53. Applica`ons
Track
bandwidth
hogs
Determine
popular
tourist
desAnaAons
Itemset
mining
Entropy
esAmaAon
Compressed
sensing
Search
log
mining
Network
data
analysis
DBMS
opAmizaAon
53
Frequent Elements
A core streaming problem
54. Count-‐min
Sketch
[1]
A
two-‐dimensional
array
counts
with
w
columns
and
d
rows
Each
entry
of
the
array
is
iniAally
zero
d
hash
funcAons
are
chosen
uniformly
at
random
from
a
pairwise
independent
family
Update
For
a
new
element
i,
for
each
row
j
and
k
=
hj(i),
increment
the
kth
column
by
one
Point
query
where,
sketch
is
the
table
Parameters
54
Frequent Elements
A core streaming problem
[1]
Cormode,
Graham;
S.
Muthukrishnan
(2005).
"An
Improved
Data
Stream
Summary:
The
Count-‐Min
Sketch
and
its
ApplicaAons".
J.
Algorithms
55:
29–38.
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
sketch
55. Variants
of
Count-‐min
Sketch
[1]
Count-‐Min
sketch
with
conservaAve
update
(CU
sketch)
Update
an
item
with
frequency
c
Avoid
unnecessary
updaAng
of
counter
values
=>
Reduce
over-‐esAmaAon
error
Prone
to
over-‐esAmaAon
error
on
low-‐frequency
items
Lossy
ConservaAve
Update
(LCU)
-‐
SWS
Divide
stream
into
windows
At
window
boundaries,
∀
1
≤
i
≤
w,
1
≤
j
≤
d,
decrement
sketch[i,j]
if
0
<
sketch[i,j]
≤
55
Frequent Elements
A core streaming problem
[1]
Cormode,
G.
2009.
Encyclopedia
entry
on
’Count-‐MinSketch’.
In
Encyclopedia
of
Database
Systems.
Springer.,
511–516.
56. 56
Anomaly Detection
Large
set
of
real-‐world
applica`ons
Social
media:
Trending
analysis
Fraud
detecAon:
Insurance,
E-‐commerce,
MarkeAng
Network
intrusion
detecAon
Health
care
Sensor
networks
Anomalous
state
detecAon
(e.g.,
wind
turbines)
OperaAons
Metric
space:
System,
ApplicaAon,
Data
Center
PotenAally
impact
performance,
availability,
reliability
Researched over > 50 yrs
57. 57
Anomaly Detection
Anomaly
is
contextual
Manufacturing
StaAsAcs
Econometrics,
Financial
engineering
Signal
processing
Control
systems,
Autonomous
systems
-‐
fault
detecAon
[1]
Networking
ComputaAonal
biology
(e.g.,
microarray
analysis)
Computer
vision
Researched over > 50 yrs
[1]
A.
S.
Willsky,
“A
survey
of
design
methods
for
failure
detecAon
systems,”
AutomaAca,
vol.
12,
pp.
601–611,
1976.
59. 59
Anomaly Detection
Tradi`onal
Approaches
Rule
based:
μ
±
σ
Manufacturing,
StaAsAcal
Process
Control
[1]
Moving
averages
SMA
EWMA
PEWMA
AssumpAon:
Normal
distribuAon
Mostly
does
not
hold
in
real
life
Researched over > 50 yrs
[1]
W.
A.
Shewhart.
Economic
Quality
Control
of
Manufactured
Product,
The
Bell
Labs
Technical
Journal,
9(2):364-‐389,
1930.
[1]
60. 60
Anomaly Detection
In
Prac`ce
Robustness
μ
and
σ
are
not
robust
in
presence
of
anomalies
Use
median
and
MAD
(Median
Absolute
DeviaAon)
Seasonality
Trend
MulA-‐modal
distribuAon
Time
series
decomposiAon
AnomalyDetecAon
R
package
[1]
Researched over > 50 yrs
[1]
hPps://github.com/twiPer/AnomalyDetecAon
61. Marrying
Time
Series
Decomposi`on
and
Robust
Sta`s`cs
61
Anomaly Detection
Researched over > 50 yrs
Trend Smoothing Distortion
Creates “Phantom” Anomalies
Median is Free from Distortion
62. 62
Anomaly Detection
Real-‐Time
Challenges
AdapAve
learning
Automated
modeling
Marrying
theory
with
contextual
relevance
OperaAons
Large
set
of
different
services
in
a
technology
stack
Different
stacks
use
different
services
Promising
products
such
as
Opsclarity
Researched over > 50 yrs
63. 63
Anomaly Detection
Researched over > 50 yrs
Anomalies
in
opera`onal
data:
Challenges
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
• Automatically discover Developer / Architect’s view of the
application - for the Operations team
- Framework for system config and context
• Real-time, streaming architecture
- Keeps up with today’s elastic infrastructure
• Scale to 1000s of hosts, 100s of (micro) services
• Present evolution of system state over time
- DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations
64. 64
Anomaly Detection
Researched over > 50 yrs
Anomalies
in
opera`onal
data:
Challenges
AutomaAcally
learn
base-‐lines
for
metrics
Data
variety
requires
advanced
staAsAcal
approaches
Detect
issues
earlier,
proacAve
alerAng
Example: Detecting Disk Full Issues Early
66. 66
The Key Aspects
Requirements of Stream Processing
In-stream Handle imperfections Predictable Performance
Process
data
as
it
is
passes
by
Delayed,
missing
and
out-‐of-‐order
data
and
Repeatable and
Scalability
I
8
Requirements
of
Stream
Processing,
Mike
Stonebraker
et.
al,
SIGMOD
Record
2005
67. 67
The Key Aspects
Requirements of Stream Processing
High level languages Integrate stored and
streaming data
Data safety and
availability
Process and respond
SQL
or
DSL
for
comparing
present
with
the
past
and
Repeatable
ApplicaAon
should
keep
at
high
volumes
8
Requirements
of
Stream
Processing,
Mike
Stonebraker
et.
al,
SIGMOD
Record
2005
# # $ %
68. 68
Window Processing
Stream Processing
T.
Akidau
et
al.,
The
Dataflow
Model:
A
PracAcal
Approach
to
Balancing
Correctness,
Latency,
and
Cost
in
Massive-‐Scale,
Unbounded,
Out-‐of-‐Order
Data
Processing,
In
VLDB,
2015.
&
# $
69. 69
Three Generations
First Generation
Extensions
to
exisAng
database
engines
or
simplisAc
engines
Dedicated
to
specific
applicaAons
or
use
cases
Second Generation
Enhanced
methods
regarding
language
expressiveness
Distributed
processing,
load
balancing
and
fault
tolerance
Third Generation
Massive
parallelizaAon
for
processing
large
data
sets
Dedicated
towards
cloud
compuAng
,
%
hPp://www.slideshare.net/zbigniew.jerzak/cloudbased-‐data-‐stream-‐processing
73. 73
Notable features
1st Generation Systems
Early: Active DBs, ECA rules, triggers,
publish-subscribe
Event-Condition-Action
)
'
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Event
Occurrences
Triggered
Rules
Evaluated
Rules
Selected
Rules
Event
Source
Signaling Triggering
EvaluaAon
SchedulingExecuAon
G Systems - HiPAC, Starbust, Postgres, ODE
“AcAve
Database
Systems”,
Paton
and
Diaz,
ACM
CompuAng
Surveys,
1999
74. 74
Notable features
1st Generation Applications
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Actuation (also IoT?)
Finance
Enforcing database integrity constraints
Monitoring the physical world (IoT?)
Supply chain
News and update dissemination
(
#)
#
Battlefield awarenessHealth monitoring
-
d
75. 75
Issues
1st Generation Systems
Rules were (are) hard to program
or understand
Smart engineering of traditional approaches
can get you close enough?!
Little commercial activity
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
#
77. 77
Early 2000s Late 2000s
2nd Generation Systems
Niagara CQ
[Jianjun
Chun
et
al.,
2000]
Telegraph, Telegraph CQ
[Hellerstein
et
al.,
2000]
[Chandrasekaran
et
al.,
2003]
!
80. Repeatedly apply generic SQL to the results of window operators
80
The basic idea
Stream Query Processing
Support full SQL language and eco system
A table is a set of records and a stream is an unbounded
sequence of records
SQL
g
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Each window outputs a set of records
Window operators convert streams to
tablesÄ
Rstream
semanAcs
in
CQL,
Arvind
Arasu
et
al.
VLDB
Journal
2006
Streams Tables
Window
Operators
3
#
$
81. 81
Telegraph CQ
Data
stream
query
processor
Con`nuous
and
adap`ve
query
processing
Built
by
modifying
PostgreSQL
01
02
03
Developed at University of California, Berkeley
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
82. 82
Niagara CQ
Incremental
group
opAmizaAon
strategy
Incremental
evaluaAon
of
conAnuous
queries
A
distributed
database
system
for
conAnuous
queries
using
a
query
language
like
XML-‐QL
for
changing
data
sets
Query
Grouping
Allows
for
sharing
common
parts
of
two
or
more
queries
Caching
For
performance
Push/Pull
data
inges`on
for
detected
changes
in
data
Change
based
and
Timer
CQ
ConAnuous
queries
to
trigger
on
data
changes
and
regular
Amed
based
01
02
03
04
Developed at UW-Madison
84. 84
Borealis
Load
aware
distribuAon
Fine
grained
high
availability
Load
shredding
mechanisms
A
low
latency
stream
processing
engine
with
a
focus
on
fault
tolerance
and
distribuAon
Distributed
stream
engine
Allows
for
sharing
common
parts
of
two
or
more
queries
Dynamic
query
modifica`on
For
performance
Dynamic
system
op`miza`on
for
detected
changes
in
data
Dynamic
revision
of
results
ConAnuous
queries
to
trigger
on
data
changes
and
regular
Amed
based
01
02
03
04
Developed at MIT, Brown and Brandeis
85. 85
Summary
2nd Generation Systems
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Can reuse many of relational operators
Historical comparison becomes a join
of a stream and its history table
Views on streams can be created
Streams can be processed using
relational operators
Can leverage an RDMS system
Stream and stream results can be
stored in tables for later querying +
(,
g$
G
86. 86
Issues
2nd Generation Systems
Despite significant commercial activity,
no real breakout
No standardization and comprehensive
benchmarks
6
%
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
& Value proposition for learning new concepts
was not clear
88. 88
The last decade
Streaming Platforms
S4
Yahoo!
Flink
Apache
Storm
TwiPer
Spark
Databricks
Samza
LinkedIn
Heron
TwiPer
MillWheel
Google
Pulsar
eBay
%%
S-Store
ISTC,
Intel,
MIT,
Brown,
CMU,
Portland
State
S
Trill
Microso{
T
89. 89
Earliest distributed stream system
Apache S4
Scalable
Throughput
is
linear
as
addiAonal
nodes
are
added
Cluster management
Hides
managements
using
a
layer
in
ZooKeeper
Decentralized
All
nodes
are
symmetric
and
no
centralized
service
Extensible
Building
blocks
of
plaeorm
can
be
replaced
by
custom
implementaAons
Fault tolerance
Standby
servers
take
over
when
a
node
fails
$
(,
g#
G
Proven
Deployed
in
Yahoo
processing
thousands
of
search
queries
per
second
91. 91
Storm Terminology
Topology
Directed
acyclic
graph
verAces
=
computaAon,
and
edges
=
streams
of
data
tuples
Spouts
Sources
of
data
tuples
for
the
topology
Examples
-‐
Ka•a/Kestrel/MySQL/Postgres
Bolts
Process
incoming
tuples,
and
emit
outgoing
tuples
Examples
-‐
filtering/aggregaAon/join/any
funcAon
,
%
93. 93
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
Live stream of Tweets
#worldcup : 1M
soccer: 400K
….
94. 94
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
When
a
parse
tweet
bolt
task
emits
a
tuple
which
word
count
bolt
task
should
it
send
to?
% %% %% %% %
95. 95
Storm Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
108. 108
Some experiments
Storm Overheads
Read
from
Ka•a
cluster
and
serialize
in
a
loop
Sustain
input
rates
of
300K
msgs/sec
from
Ka•a
topic
Java program
No
acks
to
achieve
at
least
once
semanAcs
Storm
processes
were
co-‐located
using
isolaAon
scheduler
1-stage topology
Enable
acks
for
at
least
once
semanAcs
1-stage topology
with acks
115. 115
MillWheel
Computations
Arbitrary
User
Logic
Per
Key
OperaAon
Persistent State
Key/Value
API
Backed
by
BigTable
Streams
IdenAfied
By
Names
Unbounded
Keys
Per
Key
OperaAon
Serial
Different
Keys
Parallel
Core Concepts
L
f
⚿
t
116. 116
MillWheel
Caught up Time
Defined
per
computaAon
Discard Late Data
~0.001%
at
Google
Seeded by Injectors
Input
Sources
Monotonic
Makes
life
easy
for
users
Low Watermark: The Concept of Time
Ê
4 6
u
117. 117
MillWheel
Checkpoint
Same
Ame
as
User
State
DoubleCount
No
Dedup
Seeded by Injectors
Input
Sources
No checkpoint
Simpler
API
Strong And Week: Productions
'
4
(
q
125. 125
One Size Fits All
Apache Flink
General
Purpose
Analy`cs
Engine
Open
Source
and
Community
Driven
Works
well
with
Hadoop
Ecosystem
K
Came
out
of
Stratosphere
n
126. 126
Apache Flink
Fast RunTime
Complex
DAG
Operators
Streamed
Data
to
Op
Iterative Algorithms
Much
Faster
In-‐
Memory
OperaAons
Intuitive APIs
Java/Scala/Python
Concise
Query
Coming
from
OLTP
World
% !
2 b
Ambitious Goal: One Size Fits All
129. 129
One system to replace them all!
General
purpose
Compute
Engine
Open
Source/Big
Community
K
MapReduce,
Streaming,
SQL,
…!
Integrates
well
with
Hadoop
Ecosystem(
130. 130
Lots
Huge
CollecAon
with
Lineage
info
Resilient
Lost
DataSets
are
re-‐
computed
Distributed
Across
the
cluster
Core Concept: Lots of RDDS
t
(
)DataSet
Input
Data
divided
into
Batches
$
Streaming
134. 134
T0
to
T1 T1
to
T2 T2
to
T3
T0
to
T1 T1
to
T2 T2
to
T3
lines
words
flatMap
Series of RDDs
5
Window FunctionsA
Can Create other Dstreamsq
Streaming: With Dstreams
Streaming
136. 136
Basic Sources
HDFS,
S3,
…
É
Reliability
ack
vs
noAck
sources
VCustom
Implement
Interface
J
^ Advanced
Ka•a,
TwiPerUAls
u
Input DStreams: Sources of Data
Streaming
137. 137
Exaclty Once
Confident
about
results
4
Ecosystem
Hadoop,Yarn,
Ka•a,
…
K
Scalable
RDDs
as
scale
unit
Single System
Batch
+
Streaming
v
Basic Premise: One Size Fits All
Streaming
138. 138
Annota`on
plugin
framework
to
extend
SQL
Stream Processing: With SQL
Processing
logic
in
SQL
%
Clustering
with
elas`c
scaling
No
down`me
during
upgrades(
141. 141
Messaging Models
Used
for
low
latency.
Producer
pushes
data
to
consumer.
Write
to
Kakfla
if
consumer
down
or
unable
to
keep
up
for
replay
later
Push
Atmost once
/
Producer
writes
events
to
Ka•a
Consumer
consumes
Ka•a
Storing
to
Ka•a
allows
for
replay
Pull
Atleast once
/
142. 142
Deployment Architecture
Events are partitioned
All
events
with
the
same
key
are
routed
to
the
same
cell
Scaling
More
cells
are
added
to
the
pipeline
for
scaling
Pulsar
automaAcally
detects
new
cells
and
rebalances
traffic
147. 147
Heron
Batching of tuples
AmorAzing
the
cost
of
transferring
tuples $
Task isolation
Ease
of
debug-‐ability/isolaAon/profiling
(Fully API compatible with Storm
Directed
acyclic
graph
Topologies,
Spouts
and
Bolts
,
Support for back pressure
Topologies
should
self
adjusAng
gUse of main stream languages
C++,
Java
and
Python #
Efficiency
Reduce resource consumption
G
Design: Goals
168. 168
Issues
3rd Generation Systems
Bit early to tell
Still no standardization and too many systems
6
%
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
172. 172
Lambda Architecture - The Good
Message
Broker
CollecAon
Pipeline
Lambda
Architecture
AnalyAcs
Pipeline
Results
173. 173
Lambda Architecture - The Bad
Have to fix everything (may be twice)!
How much Duct Tape required?
Have to write everything twice!
Subtle differences in semantics
What about Graphs, ML, SQL, etc?
$
*,
7#
176. Auto scaling the system in the presence of unpredictability
176
Technology Challenges
The Road Ahead
Auto tuning of real time analytics jobs/queries
Exploiting faster networks for efficiently moving data
Ä
Ü
J
177. Real-time personalization
177
Applications
The Road Ahead
Preferences,
Ame,
locaAon
and
social
Wearable computing
Screen
size
fragmentaAon
Analytics: Image, Video, Touch
PaPern
RecogniAon,
Anomaly
DetecAon
+