SlideShare une entreprise Scribd logo
1  sur  180
Real-time Analytics
Algorithms and Systems
Arun	
  Kejariwal*,	
  Sanjeev	
  Kulkarni+,	
  Karthik	
  Ramasamy☨	
  
*Machine	
  Zone,	
  +PeerNova,	
  ☨Twi@er
@arun_kejariwal,	
  @sanjeevrk,	
  @karthikz
2
A look at our presentation agenda
Outline
Motivation
Why	
  bother?
Emerging Applications
IoT,	
  Health	
  Care,	
  Machine	
  data	
  
Connected	
  vehicles
3
Algorithms: I
ClassificaAon
Systems: II
3rd	
  GeneraAon
Systems: I
1st	
  &	
  2nd	
  GeneraAon
Algorithms: II
Deep	
  Dive
4
The Road Ahead
Challenges
Closing
Q&A
5
Real-time is key
Information Age
Ká !
6
Large	
  variety	
  of	
  media	
  
Blogs,	
  reviews,	
  news	
  arAcles,	
  	
  streaming	
  content	
  
> 500M
Tweets	
  everyday
Challenge: Surfacing Relevant Content
Explosive Content Creation
[1]	
  hPp://www.kpcb.com/blog/2014-­‐internet-­‐trends	
  
> 300 hrs
Video	
  uploaded	
  every	
  minute
> 1.8 B
Photos	
  uploaded	
  online	
  in	
  2014	
  [1]
7
High Volume
Content Consumption
WhatsApp
Messages	
  per	
  day	
  [1]
Pandora
Listener	
  hours	
  	
  
(Q2	
  2015)	
  [3]
Skype
Calls	
  per	
  month
E-mails
Per	
  second
Google
Searches	
  /year	
  [2]
Netflix
Hours	
  streamed	
  	
  
per	
  month
>30B
5.3B
4.76B
>	
  1T
>2.2M
>	
  1B
!
É
[1]	
  hPps://www.facebook.com/jan.koum/posts/10152994719980011?pnref=story	
  
[2]	
  hPp://searchengineland.com/google-­‐1-­‐trillion-­‐searches-­‐per-­‐year-­‐212940	
  
[3]	
  hPp://press.pandora.com/phoenix.zhtml?c=251764&p=irol-­‐newsArAcle&ID=2070623
]
9
8
A New World
Mobile, Mobile, Mobile
5.4	
  B	
  Mobile	
  Phone	
  Users	
  [1]
69%	
  Y/Y	
  Growth	
  Data	
  Traffic
	
  55%	
  Mobile	
  Video	
  Traffic
34%	
  Global	
  e-­‐Commerce	
  [2]
AVAILABILITY
PERFORMANCE
RELIABILITY
Anywhere, Anytime, Any Device
Smartphone	
  Subscrip`ons	
  
in	
  2014	
  [1]
2.1B
[1]	
  hPp://www.kpcb.com/blog/2015-­‐internet-­‐trends	
  	
  
[2]	
  hPp://www.criteo.com/media/1894/criteo-­‐state-­‐of-­‐mobile-­‐commerce-­‐q1-­‐2015-­‐ppt.pdf
f
K
.
9
Market pulse
Finance/Investing
[1]	
  Image	
  borrowed	
  from	
  hPp://www.bloomberg.com/bw/arAcles/2013-­‐06-­‐06/how-­‐the-­‐robots-­‐lost-­‐high-­‐frequency-­‐tradings-­‐rise-­‐and-­‐fall	
  
[2]	
  hPp://arAcles.economicAmes.indiaAmes.com/2014-­‐12-­‐26/news/57420480_1_ravi-­‐varanasi-­‐mobile-­‐plaeorm-­‐nse
1	
  minute	
  bids	
  and	
  offers	
  
March	
  8,	
  2011
[1]
Mobile	
  trading	
  on	
  the	
  rise	
  [2]	
  
	
  NSE	
  	
  
	
  48%	
  increase	
  in	
  turnover,	
  Jan’14	
  -­‐>	
  Dec’14	
  
	
  BSE	
  	
  
0.25%	
  (Jan’14)	
  -­‐>	
  0.5%	
  (Nov’14)	
  of	
  total	
  
volume
10
Entertainment: MMOs
Game of War
Largest single world concurrent mobile game in the world
“Real-­‐`me	
  	
  
	
  	
  Many-­‐to-­‐Many	
  is	
  	
  
	
  	
  Tomorrow's	
  Internet”	
  	
  	
  	
  
	
  -­‐	
  Francois	
  Orsini	
  -­‐
Global scale
CollaboraAve:	
  make	
  alliances
Real-time messaging
Chat	
  translaAon	
  in	
  mulAple	
  
languages
11
On	
  the rise
Cybersecurity
2014
Staples
Dec’14
JP	
  Morgan
Oct’14
New	
  York
July’14
Michaels
Jan’14
PF	
  Changs
June’14
Home	
  Depot
Sept’14
UPS
Aug’14
Sony
Nov’14
OPM,	
  Anthem,	
  UCLA	
  
2015
2015
[1]	
  hPp://www.mcafee.com/us/resources/reports/rp-­‐economic-­‐impact-­‐cybercrime2.pdf
400 B [1]
12
Supporting higher volume and speed
Hardware Innovations
Massively parallel
Intel’s “Knights Landing” Xeon Phi - 72 cores [1]
High speed
Low Power
“…	
   quickly	
   idenAfy	
   fraud	
   detecAon	
   paPerns	
   in	
   financial	
  
transacAons;	
   healthcare	
   researchers	
   could	
   process	
   and	
   analyze	
  
larger	
   data	
   sets	
   in	
   real	
   Ame,	
   acceleraAng	
   complex	
   tasks	
   such	
   as	
  
geneAc	
  analysis	
  and	
  disease	
  tracking.”	
  [3]
Intel and Micron’s 3D XPoint Technology
1000x faster than NAND
[1]	
  hPp://www.anandtech.com/show/9436/quick-­‐note-­‐intel-­‐knights-­‐landing-­‐xeon-­‐phi-­‐omnipath-­‐100-­‐isc-­‐2015	
  
[2]	
  Intel	
  IDS’15	
  
[3]	
  hPp://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-­‐and-­‐micron-­‐produce-­‐breakthrough-­‐memory-­‐technology

[2]
Q
13
Hardware support for apps
Hardware Innovations
[1]	
  Images	
  borrowed	
  from	
  Julius	
  Madelblat’s	
  	
  and	
  Andy	
  Vargas,	
  Rajeev	
  Nalawadi	
  and	
  Shane	
  Abreu’s	
  Technology	
  Insight	
  at	
  IDF’15.
Image and Touch processing support in Intel’s Skylake [1]
Emerging	
  Applica`ons
Overview
15
Real time
User Experience, Productivity
Real-time Video Streams
N E W S
Drones Robotics
I N D U S T R Y 	
  
$ 4 0 	
   B 	
   b y 	
   2 0 2 0 	
   [ 3 ]
D E L I V E R Y / M O N i T O R I N G 	
  
$ 1 . 7 B 	
   f o r 	
   2 0 1 5 [ 1 ]
[1]	
  	
  hPp://www.kpcb.com/blog/2015-­‐internet-­‐trends	
  
[2]	
  hPp://www.bostondynamics.com/robot_Atlas.html	
  
[3]	
  hPp://www.marketsandmarkets.com/Market-­‐Reports/Industrial-­‐RoboAcs-­‐Market-­‐643.html
[2]
16
$1.9	
  T	
  in	
  value	
  by	
  2020	
  -­‐	
  Mfg	
  (15%),	
  Health	
  Care	
  (15%),	
  Insurance	
  (11%)	
  
26	
  B	
  -­‐	
  75	
  B	
  units	
  [2,	
  3,	
  4,	
  5]
[1]	
  	
  Background	
  image	
  taken	
  from	
  hPps://www.uspsoig.gov/sites/default/files/document-­‐library-­‐files/2015/rarc-­‐wp-­‐15-­‐013.pdf	
  
[2]	
  hPp://www.gartner.com/newsroom/id/2636073	
  
[3]	
  hPps://www.abiresearch.com/press/more-­‐than-­‐30-­‐billion-­‐devices-­‐will-­‐wirelessly-­‐conne	
  
[4]	
  hPp://newsroom.cisco.com/feature-­‐content?type=webcontent&arAcleId=1208342	
  	
  
[5]	
  hPp://www.businessinsider.com/75-­‐billion-­‐devices-­‐will-­‐be-­‐connected-­‐to-­‐the-­‐internet-­‐by-­‐2020-­‐2013-­‐10	
  
[6]	
  hPps://www.abiresearch.com/press/ibeaconble-­‐beacon-­‐shipments-­‐to-­‐break-­‐60-­‐million-­‐by/
Improve	
  operaAonal	
  efficiencies,	
  customer	
  experience,	
  new	
  business	
  modelsY
Beacons:	
  Retailers	
  and	
  bank	
  branches	
  
60M	
  units	
  market	
  by	
  2019	
  [6]
Smart	
  buildings:	
  	
  Reduce	
  energy	
  costs,	
  cut	
  maintenance	
  costs	
  
Increase	
  safety	
  &	
  security
Large Market Potential
Internet of Things
17
The Future
Biostamps [2]
Mobile
Sensor Network
Exponential growth [1]
[1]	
  hPp://opensignal.com/assets/pdf/reports/2015_08_fragmentaAon_report.pdf	
  
[2]	
  hPp://www.ericsson.com/thinkingahead/networked_society/stories/#/film/mc10-­‐biostamp
18
Continuous Monitoring
Intelligent Health Care
Tracking Movements
Measure	
  effect	
  of	
  social	
  
influences
Google Lens
Measure	
  glucose	
  level	
  in	
  
tears
Watch/Wristband
Smart Textiles
Skin	
  temperature	
  
PerspiraAon
Ingestible Sensors
MedicaAon	
  compliance	
  [1]
Heart	
  funcAon
[1]	
  hPp://www.hhnmag.com/Magazine/2015/Apr/cover-­‐medical-­‐technology
!
!
19
Connected World
Internet of Things
30	
  B	
  connected	
  devices	
  by	
  2020
Health Care
153	
  Exabytes	
  (2013)	
  -­‐>	
  2314	
  Exabytes	
  (2020)
Machine Data
40%	
  of	
  digital	
  universe	
  by	
  2020
Connected Vehicles
Data	
  transferred	
  per	
  vehicle	
  per	
  month	
  
4	
  MB	
  -­‐>	
  5	
  GB
Digital Assistants (Predictive Analytics)
$2B	
  (2012)	
  -­‐>	
  $6.5B	
  (2019)	
  [1]	
  
Siri/Cortana/Google	
  Now
Augmented/Virtual Reality
$150B	
  by	
  2020	
  [2]	
  
Oculus/HoloLens/Magic	
  Leap
Ñ
!+
>
[1]	
  hPp://www.siemens.com/innovaAon/en/home/pictures-­‐of-­‐the-­‐future/digitalizaAon-­‐and-­‐so{ware/digital-­‐assistants-­‐trends.html	
  	
  
[2]	
  hPp://techcrunch.com/2015/04/06/augmented-­‐and-­‐virtual-­‐reality-­‐to-­‐hit-­‐150-­‐billion-­‐by-­‐2020/#.7q0heh:oABw
ANALYTICS
What is
Real-Time Analytics?
21
What is Analytics?
According to wikipedia
DISCOVERY
Ability	
  to	
  idenAfy	
  paPerns	
  in	
  data	
  
COMMUNICATION
Provide	
  insights	
  in	
  a	
  meaningful	
  way
"
"
22
Types of Analytics
" E
CUBE ANALYTICS
Business	
  Intelligence
PREDICTIVE ANALYTICS
StaAsAcs	
  and	
  Machine	
  learning
23
What is Real-Time Analytics?
BATCH
high throughput
> 1 hour
monthly active users
relevance for ads
adhoc
queries
NEAR
REAL TIME
low latency
< 1 ms
Financial
Trading
ad impressions count
hash tag trends
approximate
> 1 sec
Online
Non-Transactional
latency sensitive
< 500 ms
fanout Tweets
search for Tweets
deterministic
workflows
Online
Transactional
It’s contextual
24
What is Real-Time Analytics?It’s contextual
Value&of&Data&to&Decision/Making&
Time&
Preven8ve/&
Predic8ve&
Ac8onable&
Reac8ve&
Historical&
Real%&
Time&
Seconds& Minutes& Hours& Days&
Tradi8onal&“Batch”&&&&&&&&&&&&&&&
Business&&Intelligence&
Informa9on&Half%Life&
In&Decision%Making&
Months&
Time/cri8cal&
Decisions&
[1]	
  Courtesy	
  Michael	
  Franklin,	
  BIRTE,	
  2015.	
  
25
Real Time Analytics
STREAMING
Analyze	
  data	
  as	
  it	
  is	
  being	
  
produced
INTERACTIVE
Store	
  data	
  and	
  provide	
  results	
  
instantly	
   when	
   a	
   query	
   is	
  
posed
H
C
ALGORITHMS
Mining
Streaming Data
27
It’s different
Key Characteristics
APPROXIMATE
H I G H 	
   V E L O C I T Y
ONE PASS
L O W 	
   L A T E N C Y
DISTRIBUTED
H I G H 	
   V O L U M E
28
It’s different
Key Characteristics
FAULT TOLERANCE [1]
A V A I L A B I L I T Y
SCALE OUT
H I G H 	
   P E R F O R M A N C E
ROBUST
I N C O M P L E T E 	
   D A T A
[1]	
  ByzanAne	
  failures	
  are	
  described	
  in	
  the	
  following	
  journal	
  paper:	
  J.	
  Driscoll,	
  Kevin;	
  Hall,	
  Brendan;	
  Sivencrona,	
  Håkan;	
  Zumsteg,	
  Phil	
  (2003).	
  "ByzanAne	
  Fault	
  Tolerance,	
  from	
  Theory	
  to	
  Reality"	
  2788.	
  pp.	
  235–248.
29
Categorization
Sampling
A/B	
  TesAng
Filtering
Set	
  Membership
Correlation
Fraud	
  DetecAon
"
30
Estimating Cardinality
Site	
  audience	
  analysis
Estimating Quantiles
Network	
  analysis
Estimating Moments
Databases
Frequent Elements
Trending	
  hashtags
E
31
Counting Inversions
Measure	
  sortedness	
  of	
  data
Finding Subsequences
Traffic	
  analysis
Path Analysis
Web	
  graph	
  analysis
Clustering
Medical	
  imaging
32
Data Prediction
Financial	
  trading
Anomaly Detection
Sensor	
  networks
33
Sampling
Obtain	
  a	
  representaAve	
  sample	
  from	
  a	
  data	
  stream	
  
	
  Maintain	
  dynamic	
  sample	
  
	
  A	
  data	
  stream	
  is	
  a	
  conAnuous	
  process	
  
	
  Not	
  known	
  in	
  advance	
  how	
  many	
  points	
  may	
  elapse	
  before	
  an	
  analyst	
  may	
  need	
  to	
  use	
  a	
  representaAve	
  sample	
  
	
  Reservoir	
  sampling	
  [1]	
  
	
  ProbabilisAc	
  inserAons	
  and	
  deleAons	
  on	
  arrival	
  of	
  new	
  stream	
  points	
  
	
  Probability	
  of	
  successive	
  inserAon	
  of	
  new	
  points	
  reduces	
  with	
  progression	
  of	
  the	
  stream	
  
	
  An	
  unbiased	
  sample	
  contains	
  a	
  larger	
  and	
  larger	
  fracAon	
  of	
  points	
  from	
  the	
  distant	
  history	
  of	
  the	
  stream	
  
	
  PracAcal	
  perspecAve	
  
	
  Data	
  stream	
  may	
  evolve	
  and	
  hence,	
  the	
  majority	
  of	
  the	
  points	
  in	
  the	
  sample	
  may	
  represent	
  the	
  stale	
  history
[1]	
  J.	
  S.	
  ViPer.	
  Random	
  Sampling	
  with	
  a	
  Reservoir.	
  ACM	
  TransacAons	
  on	
  MathemaAcal	
  So{ware,	
  Vol.	
  11(1):37–57,	
  March	
  1985.
34
Sampling
	
  Sliding	
  window	
  approach	
  (sample	
  size	
  k,	
  window	
  width	
  n)	
  
	
  Sequence-­‐based	
  	
  
	
  Replace	
  expired	
  element	
  with	
  newly	
  arrived	
  element	
  	
  
	
  Disadvantage:	
  highly	
  periodic	
  
	
  Chain-­‐sample	
  approach	
  	
  
	
  Select	
  element	
  ith	
  with	
  probability	
  Min(i,n)/n	
  
	
  Select	
  uniformly	
  at	
  random	
  an	
  index	
  from	
  [i+1,	
  i+n]	
  of	
  the	
  element	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  which	
  will	
  replace	
  the	
  ith	
  item	
  
	
  Maintain	
  k	
  independent	
  chain	
  samples	
  
	
  Timestamp-­‐based	
  	
  
	
  #	
  elements	
  in	
  a	
  moving	
  window	
  may	
  vary	
  over	
  Ame	
  
	
  Priority-­‐sample	
  approach
[1]	
  B.	
  Babcock.	
  Sampling	
  From	
  a	
  Moving	
  Window	
  Over	
  Streaming	
  Data.	
  In	
  Proceedings	
  of	
  SODA,	
  2002.
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
35
Sampling
	
  Biased	
  Reservoir	
  Sampling	
  [1]	
  
	
  Use	
  a	
  temporal	
  bias	
  funcAon	
  -­‐	
  recent	
  points	
  have	
  higher	
  probability	
  of	
  being	
  represented	
  in	
  the	
  sample	
  reservoir	
  
	
  Memory-­‐less	
  bias	
  funcAons	
  
	
  Future	
  probability	
  of	
  retaining	
  a	
  current	
  point	
  in	
  the	
  reservoir	
  is	
  independent	
  of	
  its	
  past	
  history	
  or	
  arrival	
  Ame	
  	
  
	
  Probability	
  of	
  an	
  rth	
  point	
  belonging	
  to	
  the	
  reservoir	
  at	
  the	
  Ame	
  t	
  is	
  proporAonal	
  to	
  the	
  bias	
  funcAon	
  	
  	
  	
  
	
  ExponenAal	
  bias	
  funcAons	
  for	
  rth	
  data	
  point	
  at	
  Ame	
  t,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  where,	
  r	
  ≤	
  t,	
  	
  λ	
  	
  	
  	
  [0,	
  1]	
  is	
  the	
  bias	
  rate	
  
	
  Maximum	
  reservoir	
  requirement	
  R(t)	
  is	
  bounded
[1]	
  C.	
  C.	
  Aggarwal.On	
  Biased	
  Reservoir	
  Sampling	
  in	
  the	
  presence	
  of	
  Stream	
  EvoluAon.	
  in	
  Proceedings	
  of	
  VLDB,	
  2006.
36
Sampling
General problem
	
  Input:	
  Tuples	
  of	
  n	
  components	
  
	
  Subset	
  are	
  key	
  components	
  -­‐	
  basis	
  for	
  sampling	
  
	
  Sample	
  of	
  size	
  a/b	
  
	
  Hash	
  key	
  to	
  b	
  buckets	
  
	
  Accept	
  a	
  tuple	
  if	
  hash	
  value	
  <	
  a	
  
	
  Space	
  constraint	
  
	
  a	
  <-­‐	
  a	
  -­‐	
  1	
  
	
  Remove	
  tuples	
  whose	
  keys	
  hash	
  to	
  a
37
Set Membership
Filtering
Determine,	
  with	
  some	
  false	
  probability,	
  if	
  an	
  item	
  in	
  a	
  data	
  stream	
  has	
  been	
  seen	
  before	
  
	
  Databases	
  (e.g.,	
  speed	
  up	
  semi-­‐join	
  operaAons),	
  Caches,	
  Routers,	
  Storage	
  Systems	
  
	
  Reduce	
  space	
  requirement	
  in	
  probabilisAc	
  rouAng	
  tables	
  
	
  Speedup	
  longest-­‐prefix	
  matching	
  of	
  IP	
  addresses	
  
	
  Encode	
  mulAcast	
  forwarding	
  informaAon	
  in	
  packets	
  
	
  Summarize	
  content	
  to	
  aid	
  collaboraAons	
  in	
  overlay	
  and	
  peer-­‐to-­‐peer	
  networks	
  
	
  Improve	
  network	
  state	
  management	
  and	
  monitoring	
  
38
Set Membership
Filtering
[1]	
  IllustraAon	
  borrowed	
  from	
  hPp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
ApplicaAon	
  to	
  hyphenaAon	
  programs	
  
Early	
  UNIX	
  spell	
  checkers
39
Set Membership
Filtering
	
  Natural	
  generalizaAon	
  of	
  hashing	
  	
  
	
  False	
  posiAves	
  are	
  possible	
  
	
  No	
  false	
  negaAves	
  
	
  No	
  deleAons	
  allowed	
  
	
  For	
  false	
  posiAve	
  rate	
  ε,	
  #	
  hash	
  funcAons	
  =	
  log2(1/ε)
where,	
  n	
  =	
  #	
  elements,	
  k	
  =	
  #	
  hash	
  funcAons	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  m	
  =	
  #	
  bits	
  in	
  the	
  array
40
Set Membership
Filtering
	
  Minimizing	
  false	
  posiAve	
  rate	
  ε	
  w.r.t.	
  k	
  [1]	
  
	
  k	
  =	
  ln	
  2	
  *	
  (m/n)	
  
	
  ε	
  =	
  (1/2)k	
  ≈	
  (0.6185)m/n	
  
	
  1.44	
  *	
  log2(1/ε)	
  bits	
  per	
  item	
  
	
  Independent	
  of	
  item	
  size	
  or	
  #	
  items	
  
	
  InformaAon-­‐theoreAc	
  minimum:	
  log2(1/ε)	
  bits	
  per	
  item	
  
	
  44%	
  overhead	
  	
  
	
  X	
  =	
  #	
  0	
  bits	
  
where
[1]	
  A.	
  Broder	
  and	
  M.	
  Mitzenmacher.	
  Network	
  ApplicaAons	
  of	
  Bloom	
  Filters:	
  A	
  Survey.	
  In	
  Internet	
  MathemaAcs	
  Vol.	
  1,	
  No.	
  4,	
  2005.
41
Set Membership
Filtering
DerivaAves	
  
	
  CounAng	
  Bloom	
  filters:	
  Support	
  deleAon	
  	
  
	
  Bit	
  -­‐>	
  small	
  counter	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Typically,	
  4	
  bits	
  per	
  counter	
  suffice	
  
	
  Increment,	
  Decrement	
  
	
  Blocked	
  Bloom	
  filters	
  
	
  d-­‐le{	
  CounAng	
  Bloom	
  filters	
  
	
  QuoAent	
  filters	
  
	
  Rank-­‐Indexed	
  Hashing
42
Set Membership
Filtering
Cuckoo Filter [1]
	
  Key	
  Highlights	
  
	
  Add	
  and	
  remove	
  items	
  dynamically	
  	
  
	
  For	
  false	
  posiAve	
  rate	
  ε	
  <	
  3%,	
  more	
  space	
  efficient	
  than	
  Bloom	
  filter	
  
	
  Higher	
  performance	
  than	
  Bloom	
  filter	
  for	
  many	
  real	
  workloads	
  
	
  AsymptoAcally	
  worse	
  performance	
  than	
  Bloom	
  filter	
  
	
  Min	
  fingerprint	
  size	
  α	
  log	
  (#	
  entries	
  in	
  table)	
  
	
  Overview	
  	
  
	
  Stores	
  only	
  a	
  fingerprint	
  of	
  an	
  item	
  inserted	
  
	
  Original	
  key	
  and	
  value	
  bits	
  of	
  each	
  item	
  not	
  retrievable	
  	
  
	
  Set	
  membership	
  query	
  for	
  item	
  x:	
  search	
  hash	
  table	
  for	
  fingerprint	
  of	
  x
[1]	
  Fan	
  et	
  al.,	
  Cuckoo	
  Filter:	
  PracAcally	
  BePer	
  Than	
  Bloom.	
  In	
  Proceedings	
  of	
  the	
  10th	
  ACM	
  InternaAonal	
  on	
  Conference	
  on	
  Emerging	
  Networking	
  Experiments	
  and	
  Technologies,	
  2014.
43
Set Membership
Filtering
[1]	
  R.	
  Pagh	
  and	
  F.	
  Rodler.	
  Cuckoo	
  hashing.	
  Journal	
  of	
  Algorithms,	
  51(2):122-­‐144,	
  2004.	
  
[2]	
  IllustraAon	
  borrowed	
  from	
  “Fan	
  et	
  al.,	
  Cuckoo	
  Filter:	
  PracAcally	
  BePer	
  Than	
  Bloom.	
  In	
  Proceedings	
  of	
  the	
  10th	
  ACM	
  InternaAonal	
  on	
  Conference	
  on	
  Emerging	
  Networking	
  Experiments	
  and	
  Technologies,	
  2014.”
[2]
IllustraAon	
  of	
  Cuckoo	
  hashing	
  [2]
Cuckoo Hashing [1]
	
  High	
  space	
  occupancy	
  
	
  PracAcal	
  implementaAons:	
  mulAple	
  items/bucket	
  
	
  Example	
  uses:	
  So{ware-­‐based	
  Ethernet	
  switches	
  
Cuckoo Filter
	
  Uses	
  a	
  mulA-­‐way	
  associaAve	
  Cuckoo	
  hash	
  table	
  
	
  Employs	
  parAal-­‐key	
  cuckoo	
  hashing	
  
	
  Relocate	
  exisAng	
  fingerprints	
  to	
  their	
  alternaAve	
  
locaAons
[2]
44
Set Membership
Filtering
Cuckoo Filter
	
  ParAal-­‐key	
  cuckoo	
  hashing	
  
	
  Fingerprint	
  hashing	
  ensures	
  uniform	
  distribuAon	
  of	
  
items	
  in	
  the	
  table	
  
	
  Length	
  of	
  fingerprint	
  <<	
  Size	
  of	
  h1	
  or	
  h2	
  
	
  Possible	
  to	
  have	
  mulAple	
  entries	
  of	
  a	
  fingerprint	
  in	
  
a	
  bucket	
  
	
  DeleAon	
  
Item	
  must	
  have	
  been	
  previously	
  inserted
Comparison
45
Estimating Cardinality
Large	
  set	
  of	
  real-­‐world	
  applica`ons	
  
	
  Database	
  systems/Search	
  engines	
  
	
  #	
  disAnct	
  queries	
  
	
  Network	
  monitoring	
  applicaAons	
  
	
  Natural	
  language	
  processing	
  
	
  #	
  disAnct	
  moAfs	
  in	
  a	
  DNA	
  sequence	
  
	
  #	
  disAnct	
  elements	
  of	
  RFID/sensor	
  networks
# Distinct Elements
46
Estimating Cardinality
Historical	
  context	
  
	
  ProbabilisAc	
  counAng	
  [Flajolet	
  and	
  MarAn,	
  1983]	
  
	
  LogLog	
  counAng	
  [Durand	
  and	
  Flajolet,	
  2003]	
  
	
  HyperLogLog	
  [Flajolet	
  et	
  al.,	
  2007]	
  
	
  Sliding	
  HyperLogLog	
  [Chabchoub	
  and	
  Hebrail,	
  2010]	
  
	
  HyperLogLog	
  in	
  PracAce	
  [Heule	
  et	
  al.,	
  2013]	
  
	
  Self-­‐Organizing	
  Bitmap	
  [Chen	
  and	
  Cao,	
  2009]	
  
	
  Discrete	
  Max-­‐Count	
  [Ting,	
  2014]	
  
	
  Sequence	
  of	
  sketches	
  forms	
  a	
  Markov	
  chain	
  when	
  h	
  is	
  a	
  strong	
  universal	
  hash	
  
	
  EsAmate	
  cardinality	
  using	
  a	
  marAngale
# Distinct Elements
N	
  ≤	
  109
47
Estimating Cardinality
Hyperloglog	
  
	
  Apply	
  hash	
  funcAon	
  h	
  to	
  every	
  element	
  in	
  a	
  mulAset	
  	
  
	
  Cardinality	
  of	
  mulAset	
  is	
  2max(ϱ)	
  where	
  0ϱ-­‐11	
  is	
  the	
  bit	
  paPern	
  observed	
  at	
  the	
  beginning	
  of	
  a	
  hash	
  value	
  
	
  Above	
  suffers	
  with	
  high	
  variance	
  
	
  Employ	
  stochasAc	
  averaging	
  
	
  ParAAon	
  input	
  stream	
  into	
  m	
  sub-­‐streams	
  Si	
  using	
  first	
  p	
  bits	
  of	
  hash	
  values	
  (m	
  =	
  2p)
# Distinct Elements
where
48
Estimating Cardinality
Hyperloglog	
  in	
  Prac`ce:	
  Op`miza`ons	
  
	
  Use	
  of	
  64-­‐bit	
  hash	
  funcAon	
  	
  
	
  Total	
  memory	
  requirement	
  5	
  *	
  2p	
  -­‐>	
  6	
  *	
  2p,	
  where	
  p	
  is	
  the	
  precision	
  
	
  Empirical	
  bias	
  correcAon	
  
	
  Uses	
  empirically	
  determined	
  data	
  for	
  cardinaliAes	
  smaller	
  than	
  5m	
  and	
  uses	
  the	
  unmodified	
  raw	
  esAmate	
  otherwise	
  
	
  Sparse	
  representaAon	
  
	
  For	
  n≪m,	
  store	
  an	
  integer	
  obtained	
  by	
  concatenaAng	
  the	
  bit	
  paPerns	
  for	
  idx	
  and	
  ϱ(w)	
  
	
  Use	
  variable	
  length	
  encoding	
  for	
  integers	
  that	
  uses	
  variable	
  number	
  of	
  bytes	
  to	
  represent	
  integers	
  
	
  Use	
  difference	
  encoding	
  -­‐	
  store	
  the	
  difference	
  between	
  successive	
  elements	
  
	
  Other	
  opAmizaAons	
  [1,	
  2]
# Distinct Elements
[1]	
  hPp://druid.io/blog/2014/02/18/hyperloglog-­‐opAmizaAons-­‐for-­‐real-­‐world-­‐systems.html	
  
[2]	
  hPp://anArez.com/news/75
49
Estimating Cardinality
Self-­‐Learning	
  Bitmap	
  (S-­‐bitmap)	
  [1]	
  
	
  Achieve	
  constant	
  relaAve	
  esAmaAon	
  errors	
  for	
  unknown	
  cardinaliAes	
  in	
  a	
  wide	
  range,	
  say	
  from	
  10s	
  to	
  >106	
  
	
  Bitmap	
  obtained	
  via	
  adapAve	
  sampling	
  process	
  
	
  Bits	
  corresponding	
  to	
  the	
  sampled	
  items	
  are	
  set	
  to	
  1	
  
	
  Sampling	
  rates	
  are	
  learned	
  from	
  #	
  disAnct	
  items	
  already	
  passed	
  and	
  reduced	
  sequenAally	
  as	
  more	
  bits	
  are	
  set	
  to	
  1	
  
	
  For	
  given	
  input	
  parameters	
  Nmax	
  and	
  esAmaAon	
  precision	
  ε,	
  size	
  of	
  bit	
  mask	
  
	
  For	
  r	
  =	
  1	
  -­‐2ε2(1+ε2)-­‐1	
  and	
  sampling	
  probability	
  pk	
  =	
  m	
  (m+1-­‐k)-­‐1(1+ε2)rk,	
  where	
  k	
  ∈	
  [1,m]	
  
	
  	
  	
  	
  	
  	
  	
  RelaAve	
  error	
  ≣	
  ε
# Distinct Elements
[1]	
  Chen	
  et	
  al.	
  “DisAnct	
  counAng	
  with	
  a	
  self-­‐learning	
  bitmap”.	
  Journal	
  of	
  the	
  American	
  StaAsAcal	
  AssociaAon,	
  106(495):879–890,	
  2011.
50
Estimating Quantiles
Large	
  set	
  of	
  real-­‐world	
  applica`ons	
  
	
  Database	
  applicaAons	
  
	
  Sensor	
  networks	
  
	
  OperaAons	
  
ProperAes	
  	
  
	
  Provide	
  tunable	
  and	
  explicit	
  guarantees	
  on	
  the	
  precision	
  of	
  approximaAon	
  
	
  Single	
  pass	
  
Early	
  work	
  
	
  [Greenwald	
  and	
  Khanna,	
  2001]	
  -­‐	
  worst	
  case	
  space	
  requirement	
  	
  
	
  [Arasu	
  and	
  Manku,	
  2004]	
  -­‐	
  sliding	
  window	
  based	
  model,	
  worst	
  case	
  space	
  requirement	
  
Quantiles, Histograms, Icebergs
51
Estimating Quantiles
q-­‐digest	
  [1]	
  
	
  Groups	
  values	
  in	
  variable	
  size	
  buckets	
  of	
  almost	
  equal	
  weights	
  
	
  Unlike	
  a	
  tradiAonal	
  histogram,	
  buckets	
  can	
  overlap	
  
	
  Key	
  features	
  
	
  Detailed	
  informaAon	
  about	
  frequent	
  values	
  preserved	
  
	
  Less	
  frequent	
  values	
  lumped	
  into	
  larger	
  buckets	
  
	
  Using	
  message	
  of	
  size	
  m,	
  answer	
  within	
  an	
  error	
  of	
  
	
  	
  Except	
  root	
  and	
  leaf	
  nodes,	
  a	
  node	
  v	
  ∈	
  q-­‐digest	
  iff
Quantiles, Histograms, Icebergs
[1]	
  Shrivastava	
  et	
  al.,	
  Medians	
  and	
  Beyond:	
  New	
  AggregaAon	
  Techniques	
  for	
  Sensor	
  Networks.	
  In	
  Proceedings	
  of	
  SenSys,	
  2004.
Max	
  signal	
  
value
#	
  Elements
Compression	
  
Factor
Complete	
  binary	
  tree
52
Estimating Quantiles
q-­‐digest	
  
	
  Building	
  a	
  q-­‐digest	
  
	
  q-­‐digests	
  can	
  be	
  constructed	
  in	
  a	
  distributed	
  fashion	
  
	
  Merge	
  q-­‐digests
Quantiles, Histograms, Icebergs
Applica`ons	
  
	
  Track	
  bandwidth	
  hogs	
  
	
  Determine	
  popular	
  tourist	
  desAnaAons	
  
	
  Itemset	
  mining	
  
	
  Entropy	
  esAmaAon	
  	
  
	
  Compressed	
  sensing	
  	
  
	
  Search	
  log	
  mining	
  
	
  Network	
  data	
  analysis	
  
	
  DBMS	
  opAmizaAon	
  
53
Frequent Elements
A core streaming problem
Count-­‐min	
  Sketch	
  [1]	
  
	
  A	
  two-­‐dimensional	
  array	
  counts	
  with	
  w	
  columns	
  and	
  d	
  rows	
  
	
  Each	
  entry	
  of	
  the	
  array	
  is	
  iniAally	
  zero	
  
	
  d	
  hash	
  funcAons	
  are	
  chosen	
  uniformly	
  at	
  random	
  from	
  a	
  pairwise	
  independent	
  family	
  
	
  Update	
  
	
  For	
  a	
  new	
  element	
  i,	
  for	
  each	
  row	
  j	
  and	
  k	
  =	
  hj(i),	
  increment	
  the	
  kth	
  column	
  by	
  one	
  
	
  Point	
  query	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  where,	
  sketch	
  is	
  the	
  table	
  
	
  Parameters
54
Frequent Elements
A core streaming problem
[1]	
  Cormode,	
  Graham;	
  S.	
  Muthukrishnan	
  (2005).	
  "An	
  Improved	
  Data	
  Stream	
  Summary:	
  The	
  Count-­‐Min	
  Sketch	
  and	
  its	
  ApplicaAons".	
  J.	
  Algorithms	
  55:	
  29–38.
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
sketch
Variants	
  of	
  Count-­‐min	
  Sketch	
  [1]	
  
	
  Count-­‐Min	
  sketch	
  with	
  conservaAve	
  update	
  (CU	
  sketch)	
  
	
  Update	
  an	
  item	
  with	
  frequency	
  c	
  
	
  Avoid	
  unnecessary	
  updaAng	
  of	
  counter	
  values	
  =>	
  Reduce	
  over-­‐esAmaAon	
  error	
  
	
  Prone	
  to	
  over-­‐esAmaAon	
  error	
  on	
  low-­‐frequency	
  items	
  	
  
	
  Lossy	
  ConservaAve	
  Update	
  (LCU)	
  -­‐	
  SWS	
  
	
  Divide	
  stream	
  into	
  windows	
  
	
  At	
  window	
  boundaries,	
  ∀	
  1	
  ≤	
  i	
  ≤	
  w,	
  1	
  ≤	
  j	
  ≤	
  d,	
  decrement	
  sketch[i,j]	
  if	
  0	
  <	
  sketch[i,j]	
  ≤	
  
55
Frequent Elements
A core streaming problem
[1]	
  Cormode,	
  G.	
  2009.	
  Encyclopedia	
  entry	
  on	
  ’Count-­‐MinSketch’.	
  In	
  Encyclopedia	
  of	
  Database	
  Systems.	
  Springer.,	
  511–516.
56
Anomaly Detection
Large	
  set	
  of	
  real-­‐world	
  applica`ons	
  
	
  Social	
  media:	
  Trending	
  analysis	
  
	
  Fraud	
  detecAon:	
  Insurance,	
  E-­‐commerce,	
  MarkeAng	
  
	
  Network	
  intrusion	
  detecAon	
  
	
  Health	
  care	
  
	
  Sensor	
  networks	
  
	
  Anomalous	
  state	
  detecAon	
  (e.g.,	
  wind	
  turbines)	
  
	
  OperaAons	
  
	
  Metric	
  space:	
  System,	
  ApplicaAon,	
  Data	
  Center	
  	
  
	
  PotenAally	
  impact	
  performance,	
  availability,	
  reliability
Researched over > 50 yrs
57
Anomaly Detection
Anomaly	
  is	
  contextual	
  
	
  Manufacturing	
  	
  
	
  StaAsAcs	
  
	
  Econometrics,	
  Financial	
  engineering	
  
	
  Signal	
  processing	
  
	
  Control	
  systems,	
  Autonomous	
  systems	
  -­‐	
  fault	
  detecAon	
  [1]	
  
	
  Networking	
  
	
  ComputaAonal	
  biology	
  (e.g.,	
  microarray	
  analysis)	
  
	
  Computer	
  vision
Researched over > 50 yrs
[1]	
  A.	
  S.	
  Willsky,	
  “A	
  survey	
  of	
  design	
  methods	
  for	
  failure	
  detecAon	
  systems,”	
  AutomaAca,	
  vol.	
  12,	
  pp.	
  601–611,	
  1976.
58
Anomaly Detection
Characteriza`on	
  
	
  Magnitude	
  
	
  Width	
  
	
  Frequency	
  
	
  DirecAon	
  
Flavors	
  
	
  Global	
  
	
  Local
Researched over > 50 yrs
Global
Local
59
Anomaly Detection
Tradi`onal	
  Approaches	
  
	
  Rule	
  based:	
  μ	
  ±	
  σ	
  
	
  Manufacturing,	
  StaAsAcal	
  Process	
  Control	
  [1]	
  	
  
	
  Moving	
  averages	
  
	
  SMA	
  
	
  EWMA	
  
	
  PEWMA	
  
	
  AssumpAon:	
  Normal	
  distribuAon	
  
	
  Mostly	
  does	
  not	
  hold	
  in	
  real	
  life
Researched over > 50 yrs
[1]	
  W.	
  A.	
  Shewhart.	
  Economic	
  Quality	
  Control	
  of	
  Manufactured	
  Product,	
  The	
  Bell	
  Labs	
  Technical	
  Journal,	
  9(2):364-­‐389,	
  1930.
[1]
60
Anomaly Detection
In	
  Prac`ce	
  
	
  Robustness	
  
	
  μ	
  and	
  σ	
  are	
  not	
  robust	
  in	
  presence	
  of	
  anomalies	
  
	
  Use	
  median	
  and	
  MAD	
  (Median	
  Absolute	
  DeviaAon)	
  	
  
	
  Seasonality	
  
	
  Trend	
  
	
  MulA-­‐modal	
  distribuAon	
  
	
  Time	
  series	
  decomposiAon	
  
	
  AnomalyDetecAon	
  R	
  package	
  [1]	
  
Researched over > 50 yrs
[1]	
  hPps://github.com/twiPer/AnomalyDetecAon
Marrying	
  Time	
  Series	
  Decomposi`on	
  and	
  Robust	
  Sta`s`cs	
  
61
Anomaly Detection
Researched over > 50 yrs
Trend Smoothing Distortion
Creates “Phantom” Anomalies
Median is Free from Distortion
62
Anomaly Detection
Real-­‐Time	
  
	
  Challenges	
  
	
  AdapAve	
  learning	
  
	
  Automated	
  modeling	
  
	
  Marrying	
  theory	
  with	
  contextual	
  relevance	
  
	
  OperaAons	
  
	
  Large	
  set	
  of	
  different	
  services	
  in	
  a	
  technology	
  stack	
  	
  
	
  Different	
  stacks	
  use	
  different	
  services	
  
	
  Promising	
  products	
  such	
  as	
  Opsclarity
Researched over > 50 yrs
63
Anomaly Detection
Researched over > 50 yrs
Anomalies	
  in	
  opera`onal	
  data:	
  Challenges
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
•  Automatically discover Developer / Architect’s view of the
application - for the Operations team
-  Framework for system config and context
•  Real-time, streaming architecture
-  Keeps up with today’s elastic infrastructure
•  Scale to 1000s of hosts, 100s of (micro) services
•  Present evolution of system state over time
-  DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations
64
Anomaly Detection
Researched over > 50 yrs
Anomalies	
  in	
  opera`onal	
  data:	
  Challenges	
  
	
  AutomaAcally	
  learn	
  base-­‐lines	
  for	
  metrics	
  
	
  Data	
  variety	
  requires	
  advanced	
  staAsAcal	
  approaches	
  
	
  Detect	
  issues	
  earlier,	
  proacAve	
  alerAng
Example: Detecting Disk Full Issues Early
SYSTEMS
Overview
"
66
The Key Aspects
Requirements of Stream Processing
In-stream Handle imperfections Predictable Performance
Process	
  data	
  as	
  it	
  is	
  
passes	
  by
Delayed,	
  missing	
  and	
  
out-­‐of-­‐order	
  data
and	
  Repeatable and	
  Scalability
I
8	
  Requirements	
  of	
  Stream	
  Processing,	
  Mike	
  Stonebraker	
  et.	
  al,	
  SIGMOD	
  Record	
  2005
67
The Key Aspects
Requirements of Stream Processing
High level languages Integrate stored and
streaming data
Data safety and
availability
Process and respond
SQL	
  or	
  DSL
for	
  comparing	
  present	
  
with	
  the	
  past
and	
  Repeatable
ApplicaAon	
  should	
  keep	
  
at	
  high	
  volumes
8	
  Requirements	
  of	
  Stream	
  Processing,	
  Mike	
  Stonebraker	
  et.	
  al,	
  SIGMOD	
  Record	
  2005
# # $ %
68
Window Processing
Stream Processing
T.	
  Akidau	
  et	
  al.,	
  The	
  Dataflow	
  Model:	
  A	
  PracAcal	
  Approach	
  to	
  Balancing	
  Correctness,	
  Latency,	
  and	
  Cost	
  in	
  Massive-­‐Scale,	
  Unbounded,	
  Out-­‐of-­‐Order	
  Data	
  Processing,	
  In	
  VLDB,	
  2015.
&
# $
69
Three Generations
First Generation
Extensions	
  to	
  exisAng	
  database	
  engines	
  or	
  simplisAc	
  engines	
  
Dedicated	
  to	
  specific	
  applicaAons	
  or	
  use	
  cases
Second Generation
Enhanced	
  methods	
  regarding	
  language	
  expressiveness	
  
Distributed	
  processing,	
  load	
  balancing	
  and	
  fault	
  tolerance
Third Generation
Massive	
  parallelizaAon	
  for	
  processing	
  large	
  data	
  sets	
  
Dedicated	
  towards	
  cloud	
  compuAng
,
%
hPp://www.slideshare.net/zbigniew.jerzak/cloudbased-­‐data-­‐stream-­‐processing
1st generation - Active Database Systems
SYSTEMS
"
71
Late 1980s Late 1990s
1st Generation Systems
HiPAC
[Dayal	
  et	
  al.,	
  1988]
Starbust
[Widom/Finkelstein	
  et	
  al.,	
  1990]
!
72
Postgres
[Stonebraker/Kemnitz	
  et	
  al.,	
  1991]
ODE
[Gehani/Jagadish	
  et	
  al.,	
  1991]
73
Notable features
1st Generation Systems
Early: Active DBs, ECA rules, triggers,
publish-subscribe
Event-Condition-Action
)
'
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
Event	
  
Occurrences
Triggered	
  
Rules
Evaluated	
  
Rules
Selected	
  
	
  Rules
Event	
  
Source
Signaling Triggering
EvaluaAon
SchedulingExecuAon
G Systems - HiPAC, Starbust, Postgres, ODE
“AcAve	
  Database	
  Systems”,	
  Paton	
  and	
  Diaz,	
  ACM	
  CompuAng	
  Surveys,	
  1999
74
Notable features
1st Generation Applications
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
Actuation (also IoT?)
Finance
Enforcing database integrity constraints
Monitoring the physical world (IoT?)
Supply chain
News and update dissemination
(
#)
#
Battlefield awarenessHealth monitoring
-
d
75
Issues
1st Generation Systems
Rules were (are) hard to program
or understand
Smart engineering of traditional approaches
can get you close enough?!
Little commercial activity
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
#
2nd generation - Streaming Database Systems
SYSTEMS
"
77
Early 2000s Late 2000s
2nd Generation Systems
Niagara CQ
[Jianjun	
  Chun	
  et	
  al.,	
  2000]
Telegraph, Telegraph CQ
[Hellerstein	
  et	
  al.,	
  2000]	
  
[Chandrasekaran	
  et	
  al.,	
  2003]
!
78
STREAM
[Arasu	
  et	
  al.,	
  2003]
Aurora
[Abadi	
  et	
  al.,	
  2003]
Borealis
[Abadi	
  et	
  al.,	
  2005]
✉
(
79
Cayuga
[Demeres	
  et	
  al.,	
  2007]
MCOPE
[Park	
  et	
  al.,	
  2009]
Repeatedly apply generic SQL to the results of window operators
80
The basic idea
Stream Query Processing
Support full SQL language and eco system
A table is a set of records and a stream is an unbounded
sequence of records
SQL
g
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
Each window outputs a set of records
Window operators convert streams to
tablesÄ
Rstream	
  semanAcs	
  in	
  CQL,	
  Arvind	
  Arasu	
  et	
  al.	
  VLDB	
  Journal	
  2006
Streams Tables
Window	
  Operators
3
#
$
81
Telegraph CQ
Data	
  stream	
  query	
  processor
Con`nuous	
  and	
  adap`ve	
  	
  
query	
  processing
Built	
  by	
  modifying	
  PostgreSQL
01
02
03
Developed at University of California, Berkeley
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
82
Niagara CQ
Incremental	
  	
  group	
  opAmizaAon	
  strategy	
  
Incremental	
  evaluaAon	
  of	
  conAnuous	
  queries
A	
   distributed	
   database	
   system	
   for	
   conAnuous	
   queries	
  
using	
   a	
   query	
   language	
   like	
   XML-­‐QL	
   for	
   changing	
   data	
  
sets
Query	
  Grouping
Allows	
  for	
  sharing	
  common	
  parts	
  of	
  
two	
  or	
  more	
  queries
Caching
For	
  performance
Push/Pull	
  data	
  inges`on
for	
  detected	
  changes	
  in	
  data
Change	
  based	
  and	
  Timer	
  CQ
ConAnuous	
  queries	
  to	
  trigger	
  on	
  data	
  
changes	
  and	
  regular	
  Amed	
  based
01
02
03
04
Developed at UW-Madison
83
Niagara CQ
Query grouping and sharing
quotes.xml
Select	
  
Symbol	
  =	
  INTC
Trigger	
  AcAon	
  1
quotes.xml
Select	
  
Symbol	
  =	
  MSFT
Trigger	
  AcAon	
  2
Select
Constant	
  
Table	
  
INTC/MSFT
quotes.xml
Split
Trigger	
  AcAon	
  1 Trigger	
  AcAon	
  2
84
Borealis
Load	
  aware	
  distribuAon	
  
Fine	
  grained	
  high	
  availability	
  
Load	
  shredding	
  mechanisms
A	
   low	
   latency	
   stream	
   processing	
   engine	
  
with	
   a	
   focus	
   on	
   fault	
   tolerance	
   and	
  
distribuAon
Distributed	
  stream	
  engine
Allows	
  for	
  sharing	
  common	
  parts	
  of	
  
two	
  or	
  more	
  queries
Dynamic	
  query	
  modifica`on
For	
  performance
Dynamic	
  system	
  op`miza`on
for	
  detected	
  changes	
  in	
  data
Dynamic	
  revision	
  of	
  results
ConAnuous	
  queries	
  to	
  trigger	
  on	
  data	
  
changes	
  and	
  regular	
  Amed	
  based
01
02
03
04
Developed at MIT, Brown and Brandeis
85
Summary
2nd Generation Systems
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
Can reuse many of relational operators
Historical comparison becomes a join
of a stream and its history table
Views on streams can be created
Streams can be processed using
relational operators
Can leverage an RDMS system
Stream and stream results can be
stored in tables for later querying +
(,
g$
G
86
Issues
2nd Generation Systems
Despite significant commercial activity,
no real breakout
No standardization and comprehensive
benchmarks
6
%
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
& Value proposition for learning new concepts
was not clear
SYSTEMS
3rd generation
"
88
The last decade
Streaming Platforms
S4
Yahoo!
Flink
Apache
Storm
TwiPer
Spark
Databricks
Samza
LinkedIn
Heron
TwiPer
MillWheel
Google
Pulsar
eBay
%%
S-Store
ISTC,	
  Intel,	
  MIT,	
  Brown,	
  CMU,	
  Portland	
  State
S
Trill
Microso{
T
89
Earliest distributed stream system
Apache S4
Scalable
Throughput	
  is	
  linear	
  as	
  addiAonal	
  
nodes	
  are	
  added
Cluster management
Hides	
  managements	
  using	
  a	
  layer	
  
in	
  ZooKeeper
Decentralized
All	
  nodes	
  are	
  symmetric	
  and	
  no	
  
centralized	
  service
Extensible
Building	
  blocks	
  of	
  plaeorm	
  can	
  be	
  replaced	
  
by	
  custom	
  implementaAons
Fault tolerance
Standby	
  servers	
  take	
  over	
  when	
  a	
  	
  
node	
  fails
$
(,
g#
G
Proven
Deployed	
  in	
  Yahoo	
  processing	
  thousands	
  of	
  
search	
  queries	
  per	
  second
90
Twitter Storm
Guaranteed
Message
Passing
Horizontal
Scalability
Robust
Fault
Tolerance
Concise
Code-Focus
on Logic
b  Ñ /
91
Storm Terminology
Topology
Directed	
  acyclic	
  graph	
  	
  
verAces	
  =	
  computaAon,	
  and	
  	
  
edges	
  =	
  streams	
  of	
  data	
  tuples
Spouts
Sources	
  of	
  data	
  tuples	
  for	
  the	
  topology	
  
Examples	
  -­‐	
  Ka•a/Kestrel/MySQL/Postgres
Bolts
Process	
  incoming	
  tuples,	
  and	
  emit	
  outgoing	
  tuples	
  
Examples	
  -­‐	
  filtering/aggregaAon/join/any	
  funcAon
,
%
92
Storm Topology
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
93
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
Live stream of Tweets
#worldcup : 1M
soccer: 400K
….
94
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
When	
  a	
  parse	
  tweet	
  bolt	
  task	
  emits	
  a	
  tuple	
  
which	
  word	
  count	
  bolt	
  task	
  should	
  it	
  send	
  to?
% %% %% %% %
95
Storm Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
96
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
% %% %% %% %
Shuffle Grouping Fields Grouping
97
Storm Architecture
Nimbus
ZK
Cluster
Supervisor
W1 W2 W3 W4
Supervisor
W1 W2 W3 W4
Topology
Submission
Assignment
Maps
Sync Code
Slave Node Slave Node
Master Node
98
Storm Worker
TASK TASKTASK TASK EXECUTOR
TASKTASK EXECUTORTASK
TASK EXECUTORTASK
99
Data Flow in Storm Workers
Global	
  Receive	
  
Thread
Global	
  Send	
  
Thread
In	
  Queue
User	
  Logic	
  	
  
Thread
Out	
  Queue
Send	
  
Thread
Outgoing	
  
Message	
  Buffer
100
Storm Metrics
Support and trouble shooting
Continuous performance
Cluster availability#
g
G
101
Collecting Topology Metrics
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
% Scribe
Metrics Bolt
102
Topology Dashboard
103
Overloaded Zookeeper
S1
S2
S3W
W
W
STORM
zk
SERVICES
104
Overloaded Zookeeper
S1
S2
S3W
W
W
STORM
zk
SERVICES
zk
105
Overloaded Zookeeper
zk
S1
S2
S3W
W
W
STORM
zk
SERVICES
106
Analyzing Zookeeper Traffic
Overloaded Zookeeper
67
%
33
%
Offset/ParAAon	
  is	
  
wriPen	
  every	
  2	
  
secs
Kafka Spout
Workers	
  write	
  
heart	
  beats	
  every	
  
3	
  secs
Storm Runtime
W
107
Heartbeat Daemons
Overloaded Zookeeper
zk
S1
S2
S3W
W
STORM
zk
SERVICES
Heartbeat	
  	
  
Cluster	
  
Key	
  Value	
  
Store
108
Some experiments
Storm Overheads
Read	
  from	
  Ka•a	
  cluster	
  and	
  serialize	
  in	
  a	
  loop	
  
Sustain	
  input	
  rates	
  of	
  300K	
  msgs/sec	
  from	
  Ka•a	
  topic
Java program
No	
  acks	
  to	
  achieve	
  at	
  least	
  once	
  semanAcs	
  
Storm	
  processes	
  were	
  co-­‐located	
  	
  using	
  isolaAon	
  scheduler
1-stage topology
Enable	
  acks	
  for	
  at	
  least	
  once	
  semanAcs
1-stage topology
with acks
109
Performance comparison
Storm Overheads
AverageCPUUtilization
0%
20%
40%
60%
80%
MachinesUsed
0
1
2
3
JAVA 1-STAGE 1-STAGE-ACK
Machines Avg. CPU
77%
58.2%58.3%
3
11
110
Storm Deployment
shared pool
storm
cluster
111
Storm Deployment
shared pool
storm
cluster
joe’s topology
isolated pools
112
Storm Deployment
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
113
Storm Deployment
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
114
MillWheel
DAG Processing
Streams	
  
ComputaAons
.
Cloud DataFlow
	
  Uses	
  MillWheel
(From Google
Not	
  OpenSource
⛔
Exactly Once
Checkpoint	
  User	
  State
4
115
MillWheel
Computations
Arbitrary	
  User	
  Logic	
  
Per	
  Key	
  OperaAon
Persistent State
Key/Value	
  API	
  
Backed	
  by	
  BigTable
Streams
IdenAfied	
  By	
  Names	
  
Unbounded
Keys
Per	
  Key	
  OperaAon	
  Serial	
  
Different	
  Keys	
  Parallel
Core Concepts
L
f
⚿
t
116
MillWheel
Caught up Time
Defined	
  per	
  computaAon
Discard Late Data
~0.001%	
  at	
  Google
Seeded by Injectors
Input	
  Sources
Monotonic
Makes	
  life	
  easy	
  for	
  users
Low Watermark: The Concept of Time
Ê
4 6
u
117
MillWheel
Checkpoint
Same	
  Ame	
  as	
  User	
  State
DoubleCount
No	
  Dedup
Seeded by Injectors
Input	
  Sources
No checkpoint
Simpler	
  API
Strong And Week: Productions
'
4
(
q
118
MillWheel
Key/Value Abstractions
ComputaAons
Persistance Layer
BigTable
Idempotent
No	
  Side	
  Effects
Batched
Efficient
Computation State: Exactly Once Semantics
ó
a t
$
119
PubSub weds Processing
Exactly	
  Once	
  Processing
4
Tightly	
  Integrated	
  with	
  Kasaq
Open	
  Sourced	
  by	
  LinkedIn
K
Durability	
  via	
  YarnV
120
Samza
ParAAon	
  1ParAAon	
  0 ParAAon	
  2
Streams: Partitioned
121
Samza
ParAAon	
  0
Task
Task: Work on a single partition
122
Samza
Stream	
  A Stream	
  B
Task	
  1 Task	
  2 Task	
  3
Stream	
  C
Job	
  1
Job: Collection of Tasks
123
Samza
Samza State API
key	
  value	
  store
State As a Stream
persist	
  on	
  Ka•a
ó
f
Stateful Tasks: Exactly Once Semantics
124
Samza
Kafka based Streams
Persistence
t Simple API
Single	
  Node	
  Job
2
Stateful
Exactly	
  Once
4 Yarn Friendly
Durability
K
Tight Coupling: Queue and Processing
125
One Size Fits All
Apache Flink
General	
  Purpose	
  Analy`cs	
  Engine
Open	
  Source	
  and	
  Community	
  Driven
Works	
  well	
  with	
  Hadoop	
  Ecosystem
K
Came	
  out	
  of	
  Stratosphere
n
126
Apache Flink
Fast RunTime
Complex	
  DAG	
  Operators	
  
Streamed	
  Data	
  to	
  Op
Iterative Algorithms
Much	
  Faster	
  In-­‐
Memory	
  OperaAons
Intuitive APIs
Java/Scala/Python	
  	
  	
  
Concise
Query
Coming	
  from	
  OLTP	
  
World
% !
2 b
Ambitious Goal: One Size Fits All
127
Apache Flink
Data Streamed
between	
  operators
.
Master
Submission	
  and	
  
Scheduling
L
Workers
Do	
  Actual	
  Work
K
Distributed Runtime: Scale
128
Apache Flink
Stack: Co-Exist with Hadoop
129
One system to replace them all!
	
  General	
  purpose	
  Compute	
  Engine
Open	
  Source/Big	
  Community
K
MapReduce,	
  Streaming,	
  SQL,	
  …!
Integrates	
  well	
  with	
  Hadoop	
  Ecosystem(
130
Lots
Huge	
  CollecAon	
  with	
  
Lineage	
  info
Resilient
Lost	
  DataSets	
  are	
  re-­‐
computed
Distributed
Across	
  the	
  cluster
Core Concept: Lots of RDDS
t
(
)DataSet
Input	
  Data	
  divided	
  into	
  
Batches
$
Streaming
131
W1
W2
W1
W3
W2
W1
W2
W1
W3
W1	
  W4	
  W3	
  
W1	
  W5	
  W4
W6	
  W2	
  W7	
  
W4	
  W7	
  W3
W5	
  W8	
  W2	
  
W1	
  W4	
  W8
FlatMap Map reduceByKey
W1:1
W2:1
W1:1
W4:1
W1:1
W5:1
W1:3
W2:4
W3:1
W4:1
W5:4
W6:2
RDDs In Action:- WordCount
Streaming
132
Scala: Functional and Concise
Streaming
133
Streaming: Fits Naturally
	
  	
  	
  	
  	
  	
  	
  Spark	
  
	
  	
  	
  Streaming
	
  	
  	
  	
  	
  	
  	
  Spark	
  
	
  	
  	
  	
  	
  	
  Engine
W3 W2 W4 W1W2W1
DStream
W2 W4 W1W3W2W1
Streaming
134
T0	
  to	
  T1 T1	
  to	
  T2 T2	
  to	
  T3
T0	
  to	
  T1 T1	
  to	
  T2 T2	
  to	
  T3
lines
words
flatMap
Series of RDDs
5
Window FunctionsA
Can Create other Dstreamsq
Streaming: With Dstreams
Streaming
135
DStream: Operators
Regular Spark Operators
map,	
  flatMap,	
  filter,	
  …
Y Transform
RDD	
  -­‐>	
  RDD
$
Window Operators
countByWindow,	
  
reduceByWindow
A Join
join	
  mulAple	
  
Dstreams
,
Streaming
136
Basic Sources
HDFS,	
  S3,	
  …
É
Reliability
ack	
  vs	
  noAck	
  sources
VCustom
Implement	
  Interface
J
^ Advanced
Ka•a,	
  TwiPerUAls
u
Input DStreams: Sources of Data
Streaming
137
Exaclty Once
Confident	
  about	
  results
4
Ecosystem
Hadoop,Yarn,	
  Ka•a,	
  …
K
Scalable
RDDs	
  as	
  scale	
  unit

Single System
Batch	
  +	
  Streaming
v
Basic Premise: One Size Fits All
Streaming
138
Annota`on	
  plugin	
  framework	
  to	
  extend	
  SQL
Stream Processing: With SQL
Processing	
  logic	
  in	
  SQL
%
Clustering	
  with	
  elas`c	
  scaling
No	
  down`me	
  during	
  upgrades(
139
Channels
Key/Value	
  API
É
Processor
SQL,	
  Custom
J
Core Concept: CEP Cell
Inbound	
  
Channel
Outbound	
  
Channel
Processor
CEP	
  Cell
140
Example Pipeline: Stitching Cells
141
Messaging Models
Used	
  for	
  low	
  latency.	
  
Producer	
  pushes	
  data	
  to	
  consumer.	
  
Write	
  to	
  Kakfla	
  if	
  consumer	
  down	
  or	
  
unable	
  to	
  keep	
  up	
  for	
  replay	
  later
Push
Atmost once
/
Producer	
  writes	
  events	
  to	
  Ka•a	
  
Consumer	
  consumes	
  Ka•a	
  
Storing	
  to	
  Ka•a	
  allows	
  for	
  replay	
  
Pull
Atleast once
/
142
Deployment Architecture
Events are partitioned
All	
  events	
  with	
  the	
  same	
  key	
  are	
  routed	
  to	
  the	
  
same	
  cell	
  
Scaling
More	
  cells	
  are	
  added	
  to	
  the	
  pipeline	
  for	
  scaling	
  
Pulsar	
   automaAcally	
   detects	
   new	
   cells	
   and	
  
rebalances	
  traffic
143
SQL:	
  Event filtering and routing
144
SQL:	
  Top N items
145
Better Storm
Twitter Heron
Container	
  Based	
  Architecture
Separate	
  Monitoring	
  and	
  Scheduling
-
Simplified	
  Execu`on	
  Model
2
Much	
  Be@er	
  Performance%
146
Storm: Issues
Heron
Poor Performance
Queue	
  ContenAons	
  
MulAple	
  Languages
&Lack of BackPressure
Unpredictable	
  Drops
!
Complex Execution Env
Hard	
  to	
  tune
! SPOF
Overloaded	
  Nimbus
"
147
Heron
Batching of tuples
AmorAzing	
  the	
  cost	
  of	
  transferring	
  tuples $
Task isolation
Ease	
  of	
  debug-­‐ability/isolaAon/profiling
(Fully API compatible with Storm
Directed	
  acyclic	
  graph	
  	
  
	
  Topologies,	
  Spouts	
  and	
  Bolts
,
Support for back pressure
Topologies	
  should	
  self	
  adjusAng
gUse of main stream languages
C++,	
  Java	
  and	
  Python #
Efficiency
Reduce resource consumption
G
Design: Goals
148
Heron
Topology 1
Topology
Submission
Scheduler
Topology 2
Topology N
Architecture: High Level
149
Heron
Topology
Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics
Manager
Metrics
Manager
Architecture: Topology
150
Heron
Gateway for metrics
G
Assigns role#
Monitoring of containers
g
Topology Master
151
Heron
Topology
Master
ZK
Cluster
Logical Plan,
Physical Plan and
Execution State
Prevent	
  mul`ple	
  TM	
  becoming	
  	
  
masters
Allows	
  other	
  process	
  to	
  discover	
  TM
01
02
Topology Master
152
Heron
% %
S1 B2 B3
%
B4
Stream Manager: BackPressure
153
Stream Manager
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
Stream Manager: BackPressure
154
Heron
Slows upstream and downstream instances
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
Stream Manager: TCP BackPressure
S1 S1
S1S1S1 S1
S1S1
155
Heron
B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
B2
B3 B4
B2
B3
B2
B3 B4
B4
Stream Manager: Spout BackPressure
156
Heron
Exposes Storm and Heron APIAPI
Collects several metricsG
Runs only one task (spout/bolt)
g
Instance: Worker Bee
157
Heron
Stream
Manager
Metrics
Manager
Gateway
Thread
Task Execution
Thread
data-in queue
data-out queue
metrics-out queue
Instance: Worker Bee
158
Heron
Topology 1
Topology 2
Topology N
Heron
Tracker
Heron
VIZ
Heron
Web
ZK
Cluster
Aurora Services
Observability
Deployment
159
Heron
Sample Topologies
160
Heron
Visualization
161
Heron
COMPONENTS EXPT #1 EXPT #2 EXPT #3 EXPT #4
Spout 25 100 200 300
Bolt 25 100 200 300
# Heron containers 25 100 200 300
# Storm workers 25 100 200 300
Performance: Settings
162
Heron
milliontuples/min
0
350
700
1050
1400
Spout Parallelism
25 100 200 500
Storm Heron
latency(ms)
0
625
1250
1875
2500
Spout Parallelism
25 100 200 500
Storm Heron
Throughput Latency
10 -14x 5 -15x
Performance: Atleast Once
163
Heron
#coresused
0
625
1250
1875
2500
Spout Parallelism
25 100 200 500
Storm Heron
2 -3x
Performance: CPU Usage
164
Heron
Throughput CPU usage
milliontuples/min
0
1250
2500
3750
5000
Spout Parallelism
25 100 200 500
Storm Heron
#coresused
0
625
1250
1875
2500
Spout Parallelism
25 100 200 500
Storm Heron
Performance: Atmost Once
165
Heron Performance
% %
Client Event Spout Distributor Bolt User Count Bolt
%
Aggregator Bolt
Shuffle Grouping Fields Grouping Fields Grouping
Performance: RTAC Topology
166
Heron
#coresused
0
100
200
300
400
Storm Heron
latency(ms)
0
17.5
35
52.5
70
Storm Heron
Latency CPU usage
Performance: RTAC Atleast Once
167
Heron
#coresused
0
62.5
125
187.5
250
Storm Heron
CPU usage
Performance: RTAC Atmost Once
168
Issues
3rd Generation Systems
Bit early to tell
Still no standardization and too many systems
6
%
Slide	
  from	
  Mike	
  Franklin,	
  VLDB	
  2015	
  BIRTE	
  Talk	
  on	
  Real	
  Time	
  AnalyAcs
169
Growing set
Commercial Platforms
01 02 03 04
08 07 06 05
Infosphere Vibe Apama
Event	
  
Processor
Data	
  Torrent Vitria	
  OI Blaze StreamBase
Prac`cal	
  Deployments
"
171
Combining batch and real time
Lambda Architecture
New	
  Data
Client
172
Lambda Architecture - The Good
Message	
  
Broker
CollecAon	
  Pipeline
Lambda	
  Architecture	
  
AnalyAcs	
  Pipeline
Results
173
Lambda Architecture - The Bad
Have to fix everything (may be twice)!
How much Duct Tape required?
Have to write everything twice!
Subtle differences in semantics
What about Graphs, ML, SQL, etc?
$
*,
7#
174
Summingbird
Summingbird	
  Program
Map	
  Reduce	
  Job
HDFS
Message	
  broker
Storm/Heron	
  Topology
Online	
  key	
  value	
  result	
  
store
Batch	
  key	
  value	
  result	
  
store
Client
175
Near real-time processing
SQL-on-Hadoop
Com
m
ercial
Commercial
Apache
Commercial
Cloudera
Hortonworks
Pivotal
MammothDB
Auto scaling the system in the presence of unpredictability
176
Technology Challenges
The Road Ahead
Auto tuning of real time analytics jobs/queries
Exploiting faster networks for efficiently moving data
Ä
Ü
J
Real-time personalization
177
Applications
The Road Ahead
Preferences,	
  Ame,	
  locaAon	
  and	
  social
Wearable computing
Screen	
  size	
  fragmentaAon
Analytics: Image, Video, Touch
PaPern	
  RecogniAon,	
  Anomaly	
  DetecAon
+
178
WHAT WHY WHERE WHEN WHO HOW
Any Question ???
179
@arun_kejariwal, @sanjeevrk, @karthikz
Get in Touch
THANKS	
  FOR	
  ATTENDING	
  !!!

Contenu connexe

Tendances

Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Geospatial Advancements in Elasticsearch
Geospatial Advancements in ElasticsearchGeospatial Advancements in Elasticsearch
Geospatial Advancements in ElasticsearchElasticsearch
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyData Con LA
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino ProjectMartin Traverso
 

Tendances (20)

Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Geospatial Advancements in Elasticsearch
Geospatial Advancements in ElasticsearchGeospatial Advancements in Elasticsearch
Geospatial Advancements in Elasticsearch
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik Ramasamy
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 

En vedette

eServices-Chp6: WOA
eServices-Chp6: WOAeServices-Chp6: WOA
eServices-Chp6: WOALilia Sfaxi
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
eServices-Tp1: Web Services
eServices-Tp1: Web ServiceseServices-Tp1: Web Services
eServices-Tp1: Web ServicesLilia Sfaxi
 
eServices-Chp1: Introduction
eServices-Chp1: IntroductioneServices-Chp1: Introduction
eServices-Chp1: IntroductionLilia Sfaxi
 
eServices-Chp2: SOA
eServices-Chp2: SOAeServices-Chp2: SOA
eServices-Chp2: SOALilia Sfaxi
 
eServices-Chp3: Composition de Services
eServices-Chp3: Composition de ServiceseServices-Chp3: Composition de Services
eServices-Chp3: Composition de ServicesLilia Sfaxi
 
eServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API ManagementeServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API ManagementLilia Sfaxi
 
eServices-Chp4: ESB
eServices-Chp4: ESBeServices-Chp4: ESB
eServices-Chp4: ESBLilia Sfaxi
 
eServices-Tp4: esb++
eServices-Tp4: esb++eServices-Tp4: esb++
eServices-Tp4: esb++Lilia Sfaxi
 
eServices-Tp5: api management
eServices-Tp5: api managementeServices-Tp5: api management
eServices-Tp5: api managementLilia Sfaxi
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
eServices-Tp2: bpel
eServices-Tp2: bpeleServices-Tp2: bpel
eServices-Tp2: bpelLilia Sfaxi
 
eServices-Tp3: esb
eServices-Tp3: esbeServices-Tp3: esb
eServices-Tp3: esbLilia Sfaxi
 

En vedette (13)

eServices-Chp6: WOA
eServices-Chp6: WOAeServices-Chp6: WOA
eServices-Chp6: WOA
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
eServices-Tp1: Web Services
eServices-Tp1: Web ServiceseServices-Tp1: Web Services
eServices-Tp1: Web Services
 
eServices-Chp1: Introduction
eServices-Chp1: IntroductioneServices-Chp1: Introduction
eServices-Chp1: Introduction
 
eServices-Chp2: SOA
eServices-Chp2: SOAeServices-Chp2: SOA
eServices-Chp2: SOA
 
eServices-Chp3: Composition de Services
eServices-Chp3: Composition de ServiceseServices-Chp3: Composition de Services
eServices-Chp3: Composition de Services
 
eServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API ManagementeServices-Chp5: Microservices et API Management
eServices-Chp5: Microservices et API Management
 
eServices-Chp4: ESB
eServices-Chp4: ESBeServices-Chp4: ESB
eServices-Chp4: ESB
 
eServices-Tp4: esb++
eServices-Tp4: esb++eServices-Tp4: esb++
eServices-Tp4: esb++
 
eServices-Tp5: api management
eServices-Tp5: api managementeServices-Tp5: api management
eServices-Tp5: api management
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
eServices-Tp2: bpel
eServices-Tp2: bpeleServices-Tp2: bpel
eServices-Tp2: bpel
 
eServices-Tp3: esb
eServices-Tp3: esbeServices-Tp3: esb
eServices-Tp3: esb
 

Similaire à Real Time Analytics: Algorithms and Systems

Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsVirtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsSplunk
 
Streaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of ThingsStreaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of ThingsDatawatchCorporation
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
 
IoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldIoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldDataWorks Summit
 
Smarter Analytics: Supporting the Enterprise with Automation
Smarter Analytics: Supporting the Enterprise with AutomationSmarter Analytics: Supporting the Enterprise with Automation
Smarter Analytics: Supporting the Enterprise with AutomationInside Analysis
 
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...Quantopian
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...VMware Tanzu
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Sarah Aerni
 
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...QuickBase, Inc.
 
5 Public Sector Technology Trends 2018
5 Public Sector Technology Trends 20185 Public Sector Technology Trends 2018
5 Public Sector Technology Trends 2018Mihae Ahn, MBA
 
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...Lightbend
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017SingleStore
 
Introduction To Sciov1 0
Introduction To Sciov1 0Introduction To Sciov1 0
Introduction To Sciov1 0ScioSales
 
Harness the Power of Big Data with Oracle
Harness the Power of Big Data with OracleHarness the Power of Big Data with Oracle
Harness the Power of Big Data with OracleSai Janakiram Penumuru
 
Benchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the MarketBenchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
 
Data and Analytics In The Digital Age
Data and Analytics In The Digital AgeData and Analytics In The Digital Age
Data and Analytics In The Digital AgeNigel Wright Group
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Intelie's Overview - How much could your company lose in a matter of minutes?
Intelie's Overview - How much could your company lose in a matter of minutes?Intelie's Overview - How much could your company lose in a matter of minutes?
Intelie's Overview - How much could your company lose in a matter of minutes?Intelie
 

Similaire à Real Time Analytics: Algorithms and Systems (20)

Big Data and Safety Culture
Big Data and Safety CultureBig Data and Safety Culture
Big Data and Safety Culture
 
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsVirtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
 
Streaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of ThingsStreaming and Visual Data Discovery for the Internet of Things
Streaming and Visual Data Discovery for the Internet of Things
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
 
IoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldIoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected World
 
Smarter Analytics: Supporting the Enterprise with Automation
Smarter Analytics: Supporting the Enterprise with AutomationSmarter Analytics: Supporting the Enterprise with Automation
Smarter Analytics: Supporting the Enterprise with Automation
 
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
 
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...
 
The New Style of Business
The New Style of Business The New Style of Business
The New Style of Business
 
5 Public Sector Technology Trends 2018
5 Public Sector Technology Trends 20185 Public Sector Technology Trends 2018
5 Public Sector Technology Trends 2018
 
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
Introduction To Sciov1 0
Introduction To Sciov1 0Introduction To Sciov1 0
Introduction To Sciov1 0
 
Harness the Power of Big Data with Oracle
Harness the Power of Big Data with OracleHarness the Power of Big Data with Oracle
Harness the Power of Big Data with Oracle
 
Benchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the MarketBenchmarking Digital Readiness: Moving at the Speed of the Market
Benchmarking Digital Readiness: Moving at the Speed of the Market
 
Data and Analytics In The Digital Age
Data and Analytics In The Digital AgeData and Analytics In The Digital Age
Data and Analytics In The Digital Age
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Intelie's Overview - How much could your company lose in a matter of minutes?
Intelie's Overview - How much could your company lose in a matter of minutes?Intelie's Overview - How much could your company lose in a matter of minutes?
Intelie's Overview - How much could your company lose in a matter of minutes?
 

Plus de Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The EdgeArun Kejariwal
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series DataArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly DetectionArun Kejariwal
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architecturesArun Kejariwal
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronArun Kejariwal
 
Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponArun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceArun Kejariwal
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionArun Kejariwal
 
A Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldArun Kejariwal
 
Isolating Events from the Fail Whale
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail WhaleArun Kejariwal
 

Plus de Arun Kejariwal (20)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
 
A Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real World
 
Isolating Events from the Fail Whale
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail Whale
 

Dernier

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 

Dernier (20)

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Real Time Analytics: Algorithms and Systems

  • 1. Real-time Analytics Algorithms and Systems Arun  Kejariwal*,  Sanjeev  Kulkarni+,  Karthik  Ramasamy☨   *Machine  Zone,  +PeerNova,  ☨Twi@er @arun_kejariwal,  @sanjeevrk,  @karthikz
  • 2. 2 A look at our presentation agenda Outline Motivation Why  bother? Emerging Applications IoT,  Health  Care,  Machine  data   Connected  vehicles
  • 3. 3 Algorithms: I ClassificaAon Systems: II 3rd  GeneraAon Systems: I 1st  &  2nd  GeneraAon Algorithms: II Deep  Dive
  • 6. 6 Large  variety  of  media   Blogs,  reviews,  news  arAcles,    streaming  content   > 500M Tweets  everyday Challenge: Surfacing Relevant Content Explosive Content Creation [1]  hPp://www.kpcb.com/blog/2014-­‐internet-­‐trends   > 300 hrs Video  uploaded  every  minute > 1.8 B Photos  uploaded  online  in  2014  [1]
  • 7. 7 High Volume Content Consumption WhatsApp Messages  per  day  [1] Pandora Listener  hours     (Q2  2015)  [3] Skype Calls  per  month E-mails Per  second Google Searches  /year  [2] Netflix Hours  streamed     per  month >30B 5.3B 4.76B >  1T >2.2M >  1B ! É [1]  hPps://www.facebook.com/jan.koum/posts/10152994719980011?pnref=story   [2]  hPp://searchengineland.com/google-­‐1-­‐trillion-­‐searches-­‐per-­‐year-­‐212940   [3]  hPp://press.pandora.com/phoenix.zhtml?c=251764&p=irol-­‐newsArAcle&ID=2070623 ] 9
  • 8. 8 A New World Mobile, Mobile, Mobile 5.4  B  Mobile  Phone  Users  [1] 69%  Y/Y  Growth  Data  Traffic  55%  Mobile  Video  Traffic 34%  Global  e-­‐Commerce  [2] AVAILABILITY PERFORMANCE RELIABILITY Anywhere, Anytime, Any Device Smartphone  Subscrip`ons   in  2014  [1] 2.1B [1]  hPp://www.kpcb.com/blog/2015-­‐internet-­‐trends     [2]  hPp://www.criteo.com/media/1894/criteo-­‐state-­‐of-­‐mobile-­‐commerce-­‐q1-­‐2015-­‐ppt.pdf f K .
  • 9. 9 Market pulse Finance/Investing [1]  Image  borrowed  from  hPp://www.bloomberg.com/bw/arAcles/2013-­‐06-­‐06/how-­‐the-­‐robots-­‐lost-­‐high-­‐frequency-­‐tradings-­‐rise-­‐and-­‐fall   [2]  hPp://arAcles.economicAmes.indiaAmes.com/2014-­‐12-­‐26/news/57420480_1_ravi-­‐varanasi-­‐mobile-­‐plaeorm-­‐nse 1  minute  bids  and  offers   March  8,  2011 [1] Mobile  trading  on  the  rise  [2]    NSE      48%  increase  in  turnover,  Jan’14  -­‐>  Dec’14    BSE     0.25%  (Jan’14)  -­‐>  0.5%  (Nov’14)  of  total   volume
  • 10. 10 Entertainment: MMOs Game of War Largest single world concurrent mobile game in the world “Real-­‐`me        Many-­‐to-­‐Many  is        Tomorrow's  Internet”          -­‐  Francois  Orsini  -­‐ Global scale CollaboraAve:  make  alliances Real-time messaging Chat  translaAon  in  mulAple   languages
  • 11. 11 On  the rise Cybersecurity 2014 Staples Dec’14 JP  Morgan Oct’14 New  York July’14 Michaels Jan’14 PF  Changs June’14 Home  Depot Sept’14 UPS Aug’14 Sony Nov’14 OPM,  Anthem,  UCLA   2015 2015 [1]  hPp://www.mcafee.com/us/resources/reports/rp-­‐economic-­‐impact-­‐cybercrime2.pdf 400 B [1]
  • 12. 12 Supporting higher volume and speed Hardware Innovations Massively parallel Intel’s “Knights Landing” Xeon Phi - 72 cores [1] High speed Low Power “…   quickly   idenAfy   fraud   detecAon   paPerns   in   financial   transacAons;   healthcare   researchers   could   process   and   analyze   larger   data   sets   in   real   Ame,   acceleraAng   complex   tasks   such   as   geneAc  analysis  and  disease  tracking.”  [3] Intel and Micron’s 3D XPoint Technology 1000x faster than NAND [1]  hPp://www.anandtech.com/show/9436/quick-­‐note-­‐intel-­‐knights-­‐landing-­‐xeon-­‐phi-­‐omnipath-­‐100-­‐isc-­‐2015   [2]  Intel  IDS’15   [3]  hPp://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-­‐and-­‐micron-­‐produce-­‐breakthrough-­‐memory-­‐technology [2] Q
  • 13. 13 Hardware support for apps Hardware Innovations [1]  Images  borrowed  from  Julius  Madelblat’s    and  Andy  Vargas,  Rajeev  Nalawadi  and  Shane  Abreu’s  Technology  Insight  at  IDF’15. Image and Touch processing support in Intel’s Skylake [1]
  • 15. 15 Real time User Experience, Productivity Real-time Video Streams N E W S Drones Robotics I N D U S T R Y   $ 4 0   B   b y   2 0 2 0   [ 3 ] D E L I V E R Y / M O N i T O R I N G   $ 1 . 7 B   f o r   2 0 1 5 [ 1 ] [1]    hPp://www.kpcb.com/blog/2015-­‐internet-­‐trends   [2]  hPp://www.bostondynamics.com/robot_Atlas.html   [3]  hPp://www.marketsandmarkets.com/Market-­‐Reports/Industrial-­‐RoboAcs-­‐Market-­‐643.html [2]
  • 16. 16 $1.9  T  in  value  by  2020  -­‐  Mfg  (15%),  Health  Care  (15%),  Insurance  (11%)   26  B  -­‐  75  B  units  [2,  3,  4,  5] [1]    Background  image  taken  from  hPps://www.uspsoig.gov/sites/default/files/document-­‐library-­‐files/2015/rarc-­‐wp-­‐15-­‐013.pdf   [2]  hPp://www.gartner.com/newsroom/id/2636073   [3]  hPps://www.abiresearch.com/press/more-­‐than-­‐30-­‐billion-­‐devices-­‐will-­‐wirelessly-­‐conne   [4]  hPp://newsroom.cisco.com/feature-­‐content?type=webcontent&arAcleId=1208342     [5]  hPp://www.businessinsider.com/75-­‐billion-­‐devices-­‐will-­‐be-­‐connected-­‐to-­‐the-­‐internet-­‐by-­‐2020-­‐2013-­‐10   [6]  hPps://www.abiresearch.com/press/ibeaconble-­‐beacon-­‐shipments-­‐to-­‐break-­‐60-­‐million-­‐by/ Improve  operaAonal  efficiencies,  customer  experience,  new  business  modelsY Beacons:  Retailers  and  bank  branches   60M  units  market  by  2019  [6] Smart  buildings:    Reduce  energy  costs,  cut  maintenance  costs   Increase  safety  &  security Large Market Potential Internet of Things
  • 17. 17 The Future Biostamps [2] Mobile Sensor Network Exponential growth [1] [1]  hPp://opensignal.com/assets/pdf/reports/2015_08_fragmentaAon_report.pdf   [2]  hPp://www.ericsson.com/thinkingahead/networked_society/stories/#/film/mc10-­‐biostamp
  • 18. 18 Continuous Monitoring Intelligent Health Care Tracking Movements Measure  effect  of  social   influences Google Lens Measure  glucose  level  in   tears Watch/Wristband Smart Textiles Skin  temperature   PerspiraAon Ingestible Sensors MedicaAon  compliance  [1] Heart  funcAon [1]  hPp://www.hhnmag.com/Magazine/2015/Apr/cover-­‐medical-­‐technology ! !
  • 19. 19 Connected World Internet of Things 30  B  connected  devices  by  2020 Health Care 153  Exabytes  (2013)  -­‐>  2314  Exabytes  (2020) Machine Data 40%  of  digital  universe  by  2020 Connected Vehicles Data  transferred  per  vehicle  per  month   4  MB  -­‐>  5  GB Digital Assistants (Predictive Analytics) $2B  (2012)  -­‐>  $6.5B  (2019)  [1]   Siri/Cortana/Google  Now Augmented/Virtual Reality $150B  by  2020  [2]   Oculus/HoloLens/Magic  Leap Ñ !+ > [1]  hPp://www.siemens.com/innovaAon/en/home/pictures-­‐of-­‐the-­‐future/digitalizaAon-­‐and-­‐so{ware/digital-­‐assistants-­‐trends.html     [2]  hPp://techcrunch.com/2015/04/06/augmented-­‐and-­‐virtual-­‐reality-­‐to-­‐hit-­‐150-­‐billion-­‐by-­‐2020/#.7q0heh:oABw
  • 21. 21 What is Analytics? According to wikipedia DISCOVERY Ability  to  idenAfy  paPerns  in  data   COMMUNICATION Provide  insights  in  a  meaningful  way " "
  • 22. 22 Types of Analytics " E CUBE ANALYTICS Business  Intelligence PREDICTIVE ANALYTICS StaAsAcs  and  Machine  learning
  • 23. 23 What is Real-Time Analytics? BATCH high throughput > 1 hour monthly active users relevance for ads adhoc queries NEAR REAL TIME low latency < 1 ms Financial Trading ad impressions count hash tag trends approximate > 1 sec Online Non-Transactional latency sensitive < 500 ms fanout Tweets search for Tweets deterministic workflows Online Transactional It’s contextual
  • 24. 24 What is Real-Time Analytics?It’s contextual Value&of&Data&to&Decision/Making& Time& Preven8ve/& Predic8ve& Ac8onable& Reac8ve& Historical& Real%& Time& Seconds& Minutes& Hours& Days& Tradi8onal&“Batch”&&&&&&&&&&&&&&& Business&&Intelligence& Informa9on&Half%Life& In&Decision%Making& Months& Time/cri8cal& Decisions& [1]  Courtesy  Michael  Franklin,  BIRTE,  2015.  
  • 25. 25 Real Time Analytics STREAMING Analyze  data  as  it  is  being   produced INTERACTIVE Store  data  and  provide  results   instantly   when   a   query   is   posed H C
  • 27. 27 It’s different Key Characteristics APPROXIMATE H I G H   V E L O C I T Y ONE PASS L O W   L A T E N C Y DISTRIBUTED H I G H   V O L U M E
  • 28. 28 It’s different Key Characteristics FAULT TOLERANCE [1] A V A I L A B I L I T Y SCALE OUT H I G H   P E R F O R M A N C E ROBUST I N C O M P L E T E   D A T A [1]  ByzanAne  failures  are  described  in  the  following  journal  paper:  J.  Driscoll,  Kevin;  Hall,  Brendan;  Sivencrona,  Håkan;  Zumsteg,  Phil  (2003).  "ByzanAne  Fault  Tolerance,  from  Theory  to  Reality"  2788.  pp.  235–248.
  • 30. 30 Estimating Cardinality Site  audience  analysis Estimating Quantiles Network  analysis Estimating Moments Databases Frequent Elements Trending  hashtags E
  • 31. 31 Counting Inversions Measure  sortedness  of  data Finding Subsequences Traffic  analysis Path Analysis Web  graph  analysis Clustering Medical  imaging
  • 32. 32 Data Prediction Financial  trading Anomaly Detection Sensor  networks
  • 33. 33 Sampling Obtain  a  representaAve  sample  from  a  data  stream    Maintain  dynamic  sample    A  data  stream  is  a  conAnuous  process    Not  known  in  advance  how  many  points  may  elapse  before  an  analyst  may  need  to  use  a  representaAve  sample    Reservoir  sampling  [1]    ProbabilisAc  inserAons  and  deleAons  on  arrival  of  new  stream  points    Probability  of  successive  inserAon  of  new  points  reduces  with  progression  of  the  stream    An  unbiased  sample  contains  a  larger  and  larger  fracAon  of  points  from  the  distant  history  of  the  stream    PracAcal  perspecAve    Data  stream  may  evolve  and  hence,  the  majority  of  the  points  in  the  sample  may  represent  the  stale  history [1]  J.  S.  ViPer.  Random  Sampling  with  a  Reservoir.  ACM  TransacAons  on  MathemaAcal  So{ware,  Vol.  11(1):37–57,  March  1985.
  • 34. 34 Sampling  Sliding  window  approach  (sample  size  k,  window  width  n)    Sequence-­‐based      Replace  expired  element  with  newly  arrived  element      Disadvantage:  highly  periodic    Chain-­‐sample  approach      Select  element  ith  with  probability  Min(i,n)/n    Select  uniformly  at  random  an  index  from  [i+1,  i+n]  of  the  element                      which  will  replace  the  ith  item    Maintain  k  independent  chain  samples    Timestamp-­‐based      #  elements  in  a  moving  window  may  vary  over  Ame    Priority-­‐sample  approach [1]  B.  Babcock.  Sampling  From  a  Moving  Window  Over  Streaming  Data.  In  Proceedings  of  SODA,  2002. 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
  • 35. 35 Sampling  Biased  Reservoir  Sampling  [1]    Use  a  temporal  bias  funcAon  -­‐  recent  points  have  higher  probability  of  being  represented  in  the  sample  reservoir    Memory-­‐less  bias  funcAons    Future  probability  of  retaining  a  current  point  in  the  reservoir  is  independent  of  its  past  history  or  arrival  Ame      Probability  of  an  rth  point  belonging  to  the  reservoir  at  the  Ame  t  is  proporAonal  to  the  bias  funcAon          ExponenAal  bias  funcAons  for  rth  data  point  at  Ame  t,                                                                                  where,  r  ≤  t,    λ        [0,  1]  is  the  bias  rate    Maximum  reservoir  requirement  R(t)  is  bounded [1]  C.  C.  Aggarwal.On  Biased  Reservoir  Sampling  in  the  presence  of  Stream  EvoluAon.  in  Proceedings  of  VLDB,  2006.
  • 36. 36 Sampling General problem  Input:  Tuples  of  n  components    Subset  are  key  components  -­‐  basis  for  sampling    Sample  of  size  a/b    Hash  key  to  b  buckets    Accept  a  tuple  if  hash  value  <  a    Space  constraint    a  <-­‐  a  -­‐  1    Remove  tuples  whose  keys  hash  to  a
  • 37. 37 Set Membership Filtering Determine,  with  some  false  probability,  if  an  item  in  a  data  stream  has  been  seen  before    Databases  (e.g.,  speed  up  semi-­‐join  operaAons),  Caches,  Routers,  Storage  Systems    Reduce  space  requirement  in  probabilisAc  rouAng  tables    Speedup  longest-­‐prefix  matching  of  IP  addresses    Encode  mulAcast  forwarding  informaAon  in  packets    Summarize  content  to  aid  collaboraAons  in  overlay  and  peer-­‐to-­‐peer  networks    Improve  network  state  management  and  monitoring  
  • 38. 38 Set Membership Filtering [1]  IllustraAon  borrowed  from  hPp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf [1] ApplicaAon  to  hyphenaAon  programs   Early  UNIX  spell  checkers
  • 39. 39 Set Membership Filtering  Natural  generalizaAon  of  hashing      False  posiAves  are  possible    No  false  negaAves    No  deleAons  allowed    For  false  posiAve  rate  ε,  #  hash  funcAons  =  log2(1/ε) where,  n  =  #  elements,  k  =  #  hash  funcAons                            m  =  #  bits  in  the  array
  • 40. 40 Set Membership Filtering  Minimizing  false  posiAve  rate  ε  w.r.t.  k  [1]    k  =  ln  2  *  (m/n)    ε  =  (1/2)k  ≈  (0.6185)m/n    1.44  *  log2(1/ε)  bits  per  item    Independent  of  item  size  or  #  items    InformaAon-­‐theoreAc  minimum:  log2(1/ε)  bits  per  item    44%  overhead      X  =  #  0  bits   where [1]  A.  Broder  and  M.  Mitzenmacher.  Network  ApplicaAons  of  Bloom  Filters:  A  Survey.  In  Internet  MathemaAcs  Vol.  1,  No.  4,  2005.
  • 41. 41 Set Membership Filtering DerivaAves    CounAng  Bloom  filters:  Support  deleAon      Bit  -­‐>  small  counter                          Typically,  4  bits  per  counter  suffice    Increment,  Decrement    Blocked  Bloom  filters    d-­‐le{  CounAng  Bloom  filters    QuoAent  filters    Rank-­‐Indexed  Hashing
  • 42. 42 Set Membership Filtering Cuckoo Filter [1]  Key  Highlights    Add  and  remove  items  dynamically      For  false  posiAve  rate  ε  <  3%,  more  space  efficient  than  Bloom  filter    Higher  performance  than  Bloom  filter  for  many  real  workloads    AsymptoAcally  worse  performance  than  Bloom  filter    Min  fingerprint  size  α  log  (#  entries  in  table)    Overview      Stores  only  a  fingerprint  of  an  item  inserted    Original  key  and  value  bits  of  each  item  not  retrievable      Set  membership  query  for  item  x:  search  hash  table  for  fingerprint  of  x [1]  Fan  et  al.,  Cuckoo  Filter:  PracAcally  BePer  Than  Bloom.  In  Proceedings  of  the  10th  ACM  InternaAonal  on  Conference  on  Emerging  Networking  Experiments  and  Technologies,  2014.
  • 43. 43 Set Membership Filtering [1]  R.  Pagh  and  F.  Rodler.  Cuckoo  hashing.  Journal  of  Algorithms,  51(2):122-­‐144,  2004.   [2]  IllustraAon  borrowed  from  “Fan  et  al.,  Cuckoo  Filter:  PracAcally  BePer  Than  Bloom.  In  Proceedings  of  the  10th  ACM  InternaAonal  on  Conference  on  Emerging  Networking  Experiments  and  Technologies,  2014.” [2] IllustraAon  of  Cuckoo  hashing  [2] Cuckoo Hashing [1]  High  space  occupancy    PracAcal  implementaAons:  mulAple  items/bucket    Example  uses:  So{ware-­‐based  Ethernet  switches   Cuckoo Filter  Uses  a  mulA-­‐way  associaAve  Cuckoo  hash  table    Employs  parAal-­‐key  cuckoo  hashing    Relocate  exisAng  fingerprints  to  their  alternaAve   locaAons [2]
  • 44. 44 Set Membership Filtering Cuckoo Filter  ParAal-­‐key  cuckoo  hashing    Fingerprint  hashing  ensures  uniform  distribuAon  of   items  in  the  table    Length  of  fingerprint  <<  Size  of  h1  or  h2    Possible  to  have  mulAple  entries  of  a  fingerprint  in   a  bucket    DeleAon   Item  must  have  been  previously  inserted Comparison
  • 45. 45 Estimating Cardinality Large  set  of  real-­‐world  applica`ons    Database  systems/Search  engines    #  disAnct  queries    Network  monitoring  applicaAons    Natural  language  processing    #  disAnct  moAfs  in  a  DNA  sequence    #  disAnct  elements  of  RFID/sensor  networks # Distinct Elements
  • 46. 46 Estimating Cardinality Historical  context    ProbabilisAc  counAng  [Flajolet  and  MarAn,  1983]    LogLog  counAng  [Durand  and  Flajolet,  2003]    HyperLogLog  [Flajolet  et  al.,  2007]    Sliding  HyperLogLog  [Chabchoub  and  Hebrail,  2010]    HyperLogLog  in  PracAce  [Heule  et  al.,  2013]    Self-­‐Organizing  Bitmap  [Chen  and  Cao,  2009]    Discrete  Max-­‐Count  [Ting,  2014]    Sequence  of  sketches  forms  a  Markov  chain  when  h  is  a  strong  universal  hash    EsAmate  cardinality  using  a  marAngale # Distinct Elements N  ≤  109
  • 47. 47 Estimating Cardinality Hyperloglog    Apply  hash  funcAon  h  to  every  element  in  a  mulAset      Cardinality  of  mulAset  is  2max(ϱ)  where  0ϱ-­‐11  is  the  bit  paPern  observed  at  the  beginning  of  a  hash  value    Above  suffers  with  high  variance    Employ  stochasAc  averaging    ParAAon  input  stream  into  m  sub-­‐streams  Si  using  first  p  bits  of  hash  values  (m  =  2p) # Distinct Elements where
  • 48. 48 Estimating Cardinality Hyperloglog  in  Prac`ce:  Op`miza`ons    Use  of  64-­‐bit  hash  funcAon      Total  memory  requirement  5  *  2p  -­‐>  6  *  2p,  where  p  is  the  precision    Empirical  bias  correcAon    Uses  empirically  determined  data  for  cardinaliAes  smaller  than  5m  and  uses  the  unmodified  raw  esAmate  otherwise    Sparse  representaAon    For  n≪m,  store  an  integer  obtained  by  concatenaAng  the  bit  paPerns  for  idx  and  ϱ(w)    Use  variable  length  encoding  for  integers  that  uses  variable  number  of  bytes  to  represent  integers    Use  difference  encoding  -­‐  store  the  difference  between  successive  elements    Other  opAmizaAons  [1,  2] # Distinct Elements [1]  hPp://druid.io/blog/2014/02/18/hyperloglog-­‐opAmizaAons-­‐for-­‐real-­‐world-­‐systems.html   [2]  hPp://anArez.com/news/75
  • 49. 49 Estimating Cardinality Self-­‐Learning  Bitmap  (S-­‐bitmap)  [1]    Achieve  constant  relaAve  esAmaAon  errors  for  unknown  cardinaliAes  in  a  wide  range,  say  from  10s  to  >106    Bitmap  obtained  via  adapAve  sampling  process    Bits  corresponding  to  the  sampled  items  are  set  to  1    Sampling  rates  are  learned  from  #  disAnct  items  already  passed  and  reduced  sequenAally  as  more  bits  are  set  to  1    For  given  input  parameters  Nmax  and  esAmaAon  precision  ε,  size  of  bit  mask    For  r  =  1  -­‐2ε2(1+ε2)-­‐1  and  sampling  probability  pk  =  m  (m+1-­‐k)-­‐1(1+ε2)rk,  where  k  ∈  [1,m]                RelaAve  error  ≣  ε # Distinct Elements [1]  Chen  et  al.  “DisAnct  counAng  with  a  self-­‐learning  bitmap”.  Journal  of  the  American  StaAsAcal  AssociaAon,  106(495):879–890,  2011.
  • 50. 50 Estimating Quantiles Large  set  of  real-­‐world  applica`ons    Database  applicaAons    Sensor  networks    OperaAons   ProperAes      Provide  tunable  and  explicit  guarantees  on  the  precision  of  approximaAon    Single  pass   Early  work    [Greenwald  and  Khanna,  2001]  -­‐  worst  case  space  requirement      [Arasu  and  Manku,  2004]  -­‐  sliding  window  based  model,  worst  case  space  requirement   Quantiles, Histograms, Icebergs
  • 51. 51 Estimating Quantiles q-­‐digest  [1]    Groups  values  in  variable  size  buckets  of  almost  equal  weights    Unlike  a  tradiAonal  histogram,  buckets  can  overlap    Key  features    Detailed  informaAon  about  frequent  values  preserved    Less  frequent  values  lumped  into  larger  buckets    Using  message  of  size  m,  answer  within  an  error  of      Except  root  and  leaf  nodes,  a  node  v  ∈  q-­‐digest  iff Quantiles, Histograms, Icebergs [1]  Shrivastava  et  al.,  Medians  and  Beyond:  New  AggregaAon  Techniques  for  Sensor  Networks.  In  Proceedings  of  SenSys,  2004. Max  signal   value #  Elements Compression   Factor Complete  binary  tree
  • 52. 52 Estimating Quantiles q-­‐digest    Building  a  q-­‐digest    q-­‐digests  can  be  constructed  in  a  distributed  fashion    Merge  q-­‐digests Quantiles, Histograms, Icebergs
  • 53. Applica`ons    Track  bandwidth  hogs    Determine  popular  tourist  desAnaAons    Itemset  mining    Entropy  esAmaAon      Compressed  sensing      Search  log  mining    Network  data  analysis    DBMS  opAmizaAon   53 Frequent Elements A core streaming problem
  • 54. Count-­‐min  Sketch  [1]    A  two-­‐dimensional  array  counts  with  w  columns  and  d  rows    Each  entry  of  the  array  is  iniAally  zero    d  hash  funcAons  are  chosen  uniformly  at  random  from  a  pairwise  independent  family    Update    For  a  new  element  i,  for  each  row  j  and  k  =  hj(i),  increment  the  kth  column  by  one    Point  query                                                                                                          where,  sketch  is  the  table    Parameters 54 Frequent Elements A core streaming problem [1]  Cormode,  Graham;  S.  Muthukrishnan  (2005).  "An  Improved  Data  Stream  Summary:  The  Count-­‐Min  Sketch  and  its  ApplicaAons".  J.  Algorithms  55:  29–38. ),( δε }1{}1{:,,1 wnhh d ……… → ! ! " # # $ = ε e w ! ! " # # $ = δ 1 lnd sketch
  • 55. Variants  of  Count-­‐min  Sketch  [1]    Count-­‐Min  sketch  with  conservaAve  update  (CU  sketch)    Update  an  item  with  frequency  c    Avoid  unnecessary  updaAng  of  counter  values  =>  Reduce  over-­‐esAmaAon  error    Prone  to  over-­‐esAmaAon  error  on  low-­‐frequency  items      Lossy  ConservaAve  Update  (LCU)  -­‐  SWS    Divide  stream  into  windows    At  window  boundaries,  ∀  1  ≤  i  ≤  w,  1  ≤  j  ≤  d,  decrement  sketch[i,j]  if  0  <  sketch[i,j]  ≤   55 Frequent Elements A core streaming problem [1]  Cormode,  G.  2009.  Encyclopedia  entry  on  ’Count-­‐MinSketch’.  In  Encyclopedia  of  Database  Systems.  Springer.,  511–516.
  • 56. 56 Anomaly Detection Large  set  of  real-­‐world  applica`ons    Social  media:  Trending  analysis    Fraud  detecAon:  Insurance,  E-­‐commerce,  MarkeAng    Network  intrusion  detecAon    Health  care    Sensor  networks    Anomalous  state  detecAon  (e.g.,  wind  turbines)    OperaAons    Metric  space:  System,  ApplicaAon,  Data  Center      PotenAally  impact  performance,  availability,  reliability Researched over > 50 yrs
  • 57. 57 Anomaly Detection Anomaly  is  contextual    Manufacturing      StaAsAcs    Econometrics,  Financial  engineering    Signal  processing    Control  systems,  Autonomous  systems  -­‐  fault  detecAon  [1]    Networking    ComputaAonal  biology  (e.g.,  microarray  analysis)    Computer  vision Researched over > 50 yrs [1]  A.  S.  Willsky,  “A  survey  of  design  methods  for  failure  detecAon  systems,”  AutomaAca,  vol.  12,  pp.  601–611,  1976.
  • 58. 58 Anomaly Detection Characteriza`on    Magnitude    Width    Frequency    DirecAon   Flavors    Global    Local Researched over > 50 yrs Global Local
  • 59. 59 Anomaly Detection Tradi`onal  Approaches    Rule  based:  μ  ±  σ    Manufacturing,  StaAsAcal  Process  Control  [1]      Moving  averages    SMA    EWMA    PEWMA    AssumpAon:  Normal  distribuAon    Mostly  does  not  hold  in  real  life Researched over > 50 yrs [1]  W.  A.  Shewhart.  Economic  Quality  Control  of  Manufactured  Product,  The  Bell  Labs  Technical  Journal,  9(2):364-­‐389,  1930. [1]
  • 60. 60 Anomaly Detection In  Prac`ce    Robustness    μ  and  σ  are  not  robust  in  presence  of  anomalies    Use  median  and  MAD  (Median  Absolute  DeviaAon)      Seasonality    Trend    MulA-­‐modal  distribuAon    Time  series  decomposiAon    AnomalyDetecAon  R  package  [1]   Researched over > 50 yrs [1]  hPps://github.com/twiPer/AnomalyDetecAon
  • 61. Marrying  Time  Series  Decomposi`on  and  Robust  Sta`s`cs   61 Anomaly Detection Researched over > 50 yrs Trend Smoothing Distortion Creates “Phantom” Anomalies Median is Free from Distortion
  • 62. 62 Anomaly Detection Real-­‐Time    Challenges    AdapAve  learning    Automated  modeling    Marrying  theory  with  contextual  relevance    OperaAons    Large  set  of  different  services  in  a  technology  stack      Different  stacks  use  different  services    Promising  products  such  as  Opsclarity Researched over > 50 yrs
  • 63. 63 Anomaly Detection Researched over > 50 yrs Anomalies  in  opera`onal  data:  Challenges Contextual Application Topology Map Hierarchical Datacenter ! Applications ! Services ! Hosts •  Automatically discover Developer / Architect’s view of the application - for the Operations team -  Framework for system config and context •  Real-time, streaming architecture -  Keeps up with today’s elastic infrastructure •  Scale to 1000s of hosts, 100s of (micro) services •  Present evolution of system state over time -  DVR-like replay of health, system changes, failures Evolving Needs of Modern Operations
  • 64. 64 Anomaly Detection Researched over > 50 yrs Anomalies  in  opera`onal  data:  Challenges    AutomaAcally  learn  base-­‐lines  for  metrics    Data  variety  requires  advanced  staAsAcal  approaches    Detect  issues  earlier,  proacAve  alerAng Example: Detecting Disk Full Issues Early
  • 66. 66 The Key Aspects Requirements of Stream Processing In-stream Handle imperfections Predictable Performance Process  data  as  it  is   passes  by Delayed,  missing  and   out-­‐of-­‐order  data and  Repeatable and  Scalability I 8  Requirements  of  Stream  Processing,  Mike  Stonebraker  et.  al,  SIGMOD  Record  2005
  • 67. 67 The Key Aspects Requirements of Stream Processing High level languages Integrate stored and streaming data Data safety and availability Process and respond SQL  or  DSL for  comparing  present   with  the  past and  Repeatable ApplicaAon  should  keep   at  high  volumes 8  Requirements  of  Stream  Processing,  Mike  Stonebraker  et.  al,  SIGMOD  Record  2005 # # $ %
  • 68. 68 Window Processing Stream Processing T.  Akidau  et  al.,  The  Dataflow  Model:  A  PracAcal  Approach  to  Balancing  Correctness,  Latency,  and  Cost  in  Massive-­‐Scale,  Unbounded,  Out-­‐of-­‐Order  Data  Processing,  In  VLDB,  2015. & # $
  • 69. 69 Three Generations First Generation Extensions  to  exisAng  database  engines  or  simplisAc  engines   Dedicated  to  specific  applicaAons  or  use  cases Second Generation Enhanced  methods  regarding  language  expressiveness   Distributed  processing,  load  balancing  and  fault  tolerance Third Generation Massive  parallelizaAon  for  processing  large  data  sets   Dedicated  towards  cloud  compuAng , % hPp://www.slideshare.net/zbigniew.jerzak/cloudbased-­‐data-­‐stream-­‐processing
  • 70. 1st generation - Active Database Systems SYSTEMS "
  • 71. 71 Late 1980s Late 1990s 1st Generation Systems HiPAC [Dayal  et  al.,  1988] Starbust [Widom/Finkelstein  et  al.,  1990] !
  • 72. 72 Postgres [Stonebraker/Kemnitz  et  al.,  1991] ODE [Gehani/Jagadish  et  al.,  1991]
  • 73. 73 Notable features 1st Generation Systems Early: Active DBs, ECA rules, triggers, publish-subscribe Event-Condition-Action ) ' Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs Event   Occurrences Triggered   Rules Evaluated   Rules Selected    Rules Event   Source Signaling Triggering EvaluaAon SchedulingExecuAon G Systems - HiPAC, Starbust, Postgres, ODE “AcAve  Database  Systems”,  Paton  and  Diaz,  ACM  CompuAng  Surveys,  1999
  • 74. 74 Notable features 1st Generation Applications Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs Actuation (also IoT?) Finance Enforcing database integrity constraints Monitoring the physical world (IoT?) Supply chain News and update dissemination ( #) # Battlefield awarenessHealth monitoring - d
  • 75. 75 Issues 1st Generation Systems Rules were (are) hard to program or understand Smart engineering of traditional approaches can get you close enough?! Little commercial activity Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs #
  • 76. 2nd generation - Streaming Database Systems SYSTEMS "
  • 77. 77 Early 2000s Late 2000s 2nd Generation Systems Niagara CQ [Jianjun  Chun  et  al.,  2000] Telegraph, Telegraph CQ [Hellerstein  et  al.,  2000]   [Chandrasekaran  et  al.,  2003] !
  • 78. 78 STREAM [Arasu  et  al.,  2003] Aurora [Abadi  et  al.,  2003] Borealis [Abadi  et  al.,  2005] ✉ (
  • 79. 79 Cayuga [Demeres  et  al.,  2007] MCOPE [Park  et  al.,  2009]
  • 80. Repeatedly apply generic SQL to the results of window operators 80 The basic idea Stream Query Processing Support full SQL language and eco system A table is a set of records and a stream is an unbounded sequence of records SQL g Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs Each window outputs a set of records Window operators convert streams to tablesÄ Rstream  semanAcs  in  CQL,  Arvind  Arasu  et  al.  VLDB  Journal  2006 Streams Tables Window  Operators 3 # $
  • 81. 81 Telegraph CQ Data  stream  query  processor Con`nuous  and  adap`ve     query  processing Built  by  modifying  PostgreSQL 01 02 03 Developed at University of California, Berkeley Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs
  • 82. 82 Niagara CQ Incremental    group  opAmizaAon  strategy   Incremental  evaluaAon  of  conAnuous  queries A   distributed   database   system   for   conAnuous   queries   using   a   query   language   like   XML-­‐QL   for   changing   data   sets Query  Grouping Allows  for  sharing  common  parts  of   two  or  more  queries Caching For  performance Push/Pull  data  inges`on for  detected  changes  in  data Change  based  and  Timer  CQ ConAnuous  queries  to  trigger  on  data   changes  and  regular  Amed  based 01 02 03 04 Developed at UW-Madison
  • 83. 83 Niagara CQ Query grouping and sharing quotes.xml Select   Symbol  =  INTC Trigger  AcAon  1 quotes.xml Select   Symbol  =  MSFT Trigger  AcAon  2 Select Constant   Table   INTC/MSFT quotes.xml Split Trigger  AcAon  1 Trigger  AcAon  2
  • 84. 84 Borealis Load  aware  distribuAon   Fine  grained  high  availability   Load  shredding  mechanisms A   low   latency   stream   processing   engine   with   a   focus   on   fault   tolerance   and   distribuAon Distributed  stream  engine Allows  for  sharing  common  parts  of   two  or  more  queries Dynamic  query  modifica`on For  performance Dynamic  system  op`miza`on for  detected  changes  in  data Dynamic  revision  of  results ConAnuous  queries  to  trigger  on  data   changes  and  regular  Amed  based 01 02 03 04 Developed at MIT, Brown and Brandeis
  • 85. 85 Summary 2nd Generation Systems Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs Can reuse many of relational operators Historical comparison becomes a join of a stream and its history table Views on streams can be created Streams can be processed using relational operators Can leverage an RDMS system Stream and stream results can be stored in tables for later querying + (, g$ G
  • 86. 86 Issues 2nd Generation Systems Despite significant commercial activity, no real breakout No standardization and comprehensive benchmarks 6 % Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs & Value proposition for learning new concepts was not clear
  • 88. 88 The last decade Streaming Platforms S4 Yahoo! Flink Apache Storm TwiPer Spark Databricks Samza LinkedIn Heron TwiPer MillWheel Google Pulsar eBay %% S-Store ISTC,  Intel,  MIT,  Brown,  CMU,  Portland  State S Trill Microso{ T
  • 89. 89 Earliest distributed stream system Apache S4 Scalable Throughput  is  linear  as  addiAonal   nodes  are  added Cluster management Hides  managements  using  a  layer   in  ZooKeeper Decentralized All  nodes  are  symmetric  and  no   centralized  service Extensible Building  blocks  of  plaeorm  can  be  replaced   by  custom  implementaAons Fault tolerance Standby  servers  take  over  when  a     node  fails $ (, g# G Proven Deployed  in  Yahoo  processing  thousands  of   search  queries  per  second
  • 91. 91 Storm Terminology Topology Directed  acyclic  graph     verAces  =  computaAon,  and     edges  =  streams  of  data  tuples Spouts Sources  of  data  tuples  for  the  topology   Examples  -­‐  Ka•a/Kestrel/MySQL/Postgres Bolts Process  incoming  tuples,  and  emit  outgoing  tuples   Examples  -­‐  filtering/aggregaAon/join/any  funcAon , %
  • 92. 92 Storm Topology % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5
  • 93. 93 Tweet Word Count Topology % % Tweet Spout Parse Tweet Bolt Word Count Bolt Live stream of Tweets #worldcup : 1M soccer: 400K ….
  • 94. 94 Tweet Word Count Topology % % Tweet Spout Parse Tweet Bolt Word Count Bolt When  a  parse  tweet  bolt  task  emits  a  tuple   which  word  count  bolt  task  should  it  send  to? % %% %% %% %
  • 95. 95 Storm Groupings 01 02 03 04 Shuffle Grouping Random distribution of tuples Fields Grouping Group tuples by a field or multiple fields All Grouping Replicates tuples to all tasks Global Grouping Send the entire stream to one task / . - ,
  • 96. 96 Tweet Word Count Topology % % Tweet Spout Parse Tweet Bolt Word Count Bolt % %% %% %% % Shuffle Grouping Fields Grouping
  • 97. 97 Storm Architecture Nimbus ZK Cluster Supervisor W1 W2 W3 W4 Supervisor W1 W2 W3 W4 Topology Submission Assignment Maps Sync Code Slave Node Slave Node Master Node
  • 98. 98 Storm Worker TASK TASKTASK TASK EXECUTOR TASKTASK EXECUTORTASK TASK EXECUTORTASK
  • 99. 99 Data Flow in Storm Workers Global  Receive   Thread Global  Send   Thread In  Queue User  Logic     Thread Out  Queue Send   Thread Outgoing   Message  Buffer
  • 100. 100 Storm Metrics Support and trouble shooting Continuous performance Cluster availability# g G
  • 101. 101 Collecting Topology Metrics % % Tweet Spout Parse Tweet Bolt Word Count Bolt % Scribe Metrics Bolt
  • 106. 106 Analyzing Zookeeper Traffic Overloaded Zookeeper 67 % 33 % Offset/ParAAon  is   wriPen  every  2   secs Kafka Spout Workers  write   heart  beats  every   3  secs Storm Runtime
  • 108. 108 Some experiments Storm Overheads Read  from  Ka•a  cluster  and  serialize  in  a  loop   Sustain  input  rates  of  300K  msgs/sec  from  Ka•a  topic Java program No  acks  to  achieve  at  least  once  semanAcs   Storm  processes  were  co-­‐located    using  isolaAon  scheduler 1-stage topology Enable  acks  for  at  least  once  semanAcs 1-stage topology with acks
  • 112. 112 Storm Deployment shared pool storm cluster joe’s topology isolated pools jane’s topology
  • 113. 113 Storm Deployment shared pool storm cluster joe’s topology isolated pools jane’s topology dave’s topology
  • 114. 114 MillWheel DAG Processing Streams   ComputaAons . Cloud DataFlow  Uses  MillWheel (From Google Not  OpenSource ⛔ Exactly Once Checkpoint  User  State 4
  • 115. 115 MillWheel Computations Arbitrary  User  Logic   Per  Key  OperaAon Persistent State Key/Value  API   Backed  by  BigTable Streams IdenAfied  By  Names   Unbounded Keys Per  Key  OperaAon  Serial   Different  Keys  Parallel Core Concepts L f ⚿ t
  • 116. 116 MillWheel Caught up Time Defined  per  computaAon Discard Late Data ~0.001%  at  Google Seeded by Injectors Input  Sources Monotonic Makes  life  easy  for  users Low Watermark: The Concept of Time Ê 4 6 u
  • 117. 117 MillWheel Checkpoint Same  Ame  as  User  State DoubleCount No  Dedup Seeded by Injectors Input  Sources No checkpoint Simpler  API Strong And Week: Productions ' 4 ( q
  • 118. 118 MillWheel Key/Value Abstractions ComputaAons Persistance Layer BigTable Idempotent No  Side  Effects Batched Efficient Computation State: Exactly Once Semantics ó a t $
  • 119. 119 PubSub weds Processing Exactly  Once  Processing 4 Tightly  Integrated  with  Kasaq Open  Sourced  by  LinkedIn K Durability  via  YarnV
  • 120. 120 Samza ParAAon  1ParAAon  0 ParAAon  2 Streams: Partitioned
  • 121. 121 Samza ParAAon  0 Task Task: Work on a single partition
  • 122. 122 Samza Stream  A Stream  B Task  1 Task  2 Task  3 Stream  C Job  1 Job: Collection of Tasks
  • 123. 123 Samza Samza State API key  value  store State As a Stream persist  on  Ka•a ó f Stateful Tasks: Exactly Once Semantics
  • 124. 124 Samza Kafka based Streams Persistence t Simple API Single  Node  Job 2 Stateful Exactly  Once 4 Yarn Friendly Durability K Tight Coupling: Queue and Processing
  • 125. 125 One Size Fits All Apache Flink General  Purpose  Analy`cs  Engine Open  Source  and  Community  Driven Works  well  with  Hadoop  Ecosystem K Came  out  of  Stratosphere n
  • 126. 126 Apache Flink Fast RunTime Complex  DAG  Operators   Streamed  Data  to  Op Iterative Algorithms Much  Faster  In-­‐ Memory  OperaAons Intuitive APIs Java/Scala/Python       Concise Query Coming  from  OLTP   World % ! 2 b Ambitious Goal: One Size Fits All
  • 127. 127 Apache Flink Data Streamed between  operators . Master Submission  and   Scheduling L Workers Do  Actual  Work K Distributed Runtime: Scale
  • 129. 129 One system to replace them all!  General  purpose  Compute  Engine Open  Source/Big  Community K MapReduce,  Streaming,  SQL,  …! Integrates  well  with  Hadoop  Ecosystem(
  • 130. 130 Lots Huge  CollecAon  with   Lineage  info Resilient Lost  DataSets  are  re-­‐ computed Distributed Across  the  cluster Core Concept: Lots of RDDS t ( )DataSet Input  Data  divided  into   Batches $ Streaming
  • 131. 131 W1 W2 W1 W3 W2 W1 W2 W1 W3 W1  W4  W3   W1  W5  W4 W6  W2  W7   W4  W7  W3 W5  W8  W2   W1  W4  W8 FlatMap Map reduceByKey W1:1 W2:1 W1:1 W4:1 W1:1 W5:1 W1:3 W2:4 W3:1 W4:1 W5:4 W6:2 RDDs In Action:- WordCount Streaming
  • 132. 132 Scala: Functional and Concise Streaming
  • 133. 133 Streaming: Fits Naturally              Spark        Streaming              Spark              Engine W3 W2 W4 W1W2W1 DStream W2 W4 W1W3W2W1 Streaming
  • 134. 134 T0  to  T1 T1  to  T2 T2  to  T3 T0  to  T1 T1  to  T2 T2  to  T3 lines words flatMap Series of RDDs 5 Window FunctionsA Can Create other Dstreamsq Streaming: With Dstreams Streaming
  • 135. 135 DStream: Operators Regular Spark Operators map,  flatMap,  filter,  … Y Transform RDD  -­‐>  RDD $ Window Operators countByWindow,   reduceByWindow A Join join  mulAple   Dstreams , Streaming
  • 136. 136 Basic Sources HDFS,  S3,  … É Reliability ack  vs  noAck  sources VCustom Implement  Interface J ^ Advanced Ka•a,  TwiPerUAls u Input DStreams: Sources of Data Streaming
  • 137. 137 Exaclty Once Confident  about  results 4 Ecosystem Hadoop,Yarn,  Ka•a,  … K Scalable RDDs  as  scale  unit Single System Batch  +  Streaming v Basic Premise: One Size Fits All Streaming
  • 138. 138 Annota`on  plugin  framework  to  extend  SQL Stream Processing: With SQL Processing  logic  in  SQL % Clustering  with  elas`c  scaling No  down`me  during  upgrades(
  • 139. 139 Channels Key/Value  API É Processor SQL,  Custom J Core Concept: CEP Cell Inbound   Channel Outbound   Channel Processor CEP  Cell
  • 141. 141 Messaging Models Used  for  low  latency.   Producer  pushes  data  to  consumer.   Write  to  Kakfla  if  consumer  down  or   unable  to  keep  up  for  replay  later Push Atmost once / Producer  writes  events  to  Ka•a   Consumer  consumes  Ka•a   Storing  to  Ka•a  allows  for  replay   Pull Atleast once /
  • 142. 142 Deployment Architecture Events are partitioned All  events  with  the  same  key  are  routed  to  the   same  cell   Scaling More  cells  are  added  to  the  pipeline  for  scaling   Pulsar   automaAcally   detects   new   cells   and   rebalances  traffic
  • 145. 145 Better Storm Twitter Heron Container  Based  Architecture Separate  Monitoring  and  Scheduling - Simplified  Execu`on  Model 2 Much  Be@er  Performance%
  • 146. 146 Storm: Issues Heron Poor Performance Queue  ContenAons   MulAple  Languages &Lack of BackPressure Unpredictable  Drops ! Complex Execution Env Hard  to  tune ! SPOF Overloaded  Nimbus "
  • 147. 147 Heron Batching of tuples AmorAzing  the  cost  of  transferring  tuples $ Task isolation Ease  of  debug-­‐ability/isolaAon/profiling (Fully API compatible with Storm Directed  acyclic  graph      Topologies,  Spouts  and  Bolts , Support for back pressure Topologies  should  self  adjusAng gUse of main stream languages C++,  Java  and  Python # Efficiency Reduce resource consumption G Design: Goals
  • 149. 149 Heron Topology Master ZK Cluster Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager Architecture: Topology
  • 150. 150 Heron Gateway for metrics G Assigns role# Monitoring of containers g Topology Master
  • 151. 151 Heron Topology Master ZK Cluster Logical Plan, Physical Plan and Execution State Prevent  mul`ple  TM  becoming     masters Allows  other  process  to  discover  TM 01 02 Topology Master
  • 152. 152 Heron % % S1 B2 B3 % B4 Stream Manager: BackPressure
  • 153. 153 Stream Manager S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 Stream Manager: BackPressure
  • 154. 154 Heron Slows upstream and downstream instances S1 B2 B3 Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 Stream Manager: TCP BackPressure
  • 156. 156 Heron Exposes Storm and Heron APIAPI Collects several metricsG Runs only one task (spout/bolt) g Instance: Worker Bee
  • 158. 158 Heron Topology 1 Topology 2 Topology N Heron Tracker Heron VIZ Heron Web ZK Cluster Aurora Services Observability Deployment
  • 161. 161 Heron COMPONENTS EXPT #1 EXPT #2 EXPT #3 EXPT #4 Spout 25 100 200 300 Bolt 25 100 200 300 # Heron containers 25 100 200 300 # Storm workers 25 100 200 300 Performance: Settings
  • 162. 162 Heron milliontuples/min 0 350 700 1050 1400 Spout Parallelism 25 100 200 500 Storm Heron latency(ms) 0 625 1250 1875 2500 Spout Parallelism 25 100 200 500 Storm Heron Throughput Latency 10 -14x 5 -15x Performance: Atleast Once
  • 163. 163 Heron #coresused 0 625 1250 1875 2500 Spout Parallelism 25 100 200 500 Storm Heron 2 -3x Performance: CPU Usage
  • 164. 164 Heron Throughput CPU usage milliontuples/min 0 1250 2500 3750 5000 Spout Parallelism 25 100 200 500 Storm Heron #coresused 0 625 1250 1875 2500 Spout Parallelism 25 100 200 500 Storm Heron Performance: Atmost Once
  • 165. 165 Heron Performance % % Client Event Spout Distributor Bolt User Count Bolt % Aggregator Bolt Shuffle Grouping Fields Grouping Fields Grouping Performance: RTAC Topology
  • 168. 168 Issues 3rd Generation Systems Bit early to tell Still no standardization and too many systems 6 % Slide  from  Mike  Franklin,  VLDB  2015  BIRTE  Talk  on  Real  Time  AnalyAcs
  • 169. 169 Growing set Commercial Platforms 01 02 03 04 08 07 06 05 Infosphere Vibe Apama Event   Processor Data  Torrent Vitria  OI Blaze StreamBase
  • 171. 171 Combining batch and real time Lambda Architecture New  Data Client
  • 172. 172 Lambda Architecture - The Good Message   Broker CollecAon  Pipeline Lambda  Architecture   AnalyAcs  Pipeline Results
  • 173. 173 Lambda Architecture - The Bad Have to fix everything (may be twice)! How much Duct Tape required? Have to write everything twice! Subtle differences in semantics What about Graphs, ML, SQL, etc? $ *, 7#
  • 174. 174 Summingbird Summingbird  Program Map  Reduce  Job HDFS Message  broker Storm/Heron  Topology Online  key  value  result   store Batch  key  value  result   store Client
  • 176. Auto scaling the system in the presence of unpredictability 176 Technology Challenges The Road Ahead Auto tuning of real time analytics jobs/queries Exploiting faster networks for efficiently moving data Ä Ü J
  • 177. Real-time personalization 177 Applications The Road Ahead Preferences,  Ame,  locaAon  and  social Wearable computing Screen  size  fragmentaAon Analytics: Image, Video, Touch PaPern  RecogniAon,  Anomaly  DetecAon +
  • 178. 178 WHAT WHY WHERE WHEN WHO HOW Any Question ???