SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
1	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Real-­‐&me	
  Learning	
  with	
  Hadoop	
  
The	
  “λ	
  +	
  ε”	
  architecture	
  
2	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
§  Contact:	
  
–  tdunning@maprtech.com	
  
–  @ted_dunning	
  
§  Slides	
  and	
  such	
  (available	
  late	
  tonight):	
  
–  hEp://slideshare.net/tdunning	
  
§  Hash	
  tags:	
  #mapr	
  #storm	
  
	
  	
  
3	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Challenge	
  
§  Hadoop	
  is	
  great	
  of	
  processing	
  vats	
  of	
  data	
  
–  But	
  sucks	
  for	
  real-­‐6me	
  (by	
  design!)	
  
	
  
§  Storm	
  is	
  great	
  for	
  real-­‐6me	
  processing	
  
–  But	
  lacks	
  any	
  way	
  to	
  deal	
  with	
  batch	
  processing	
  
§  It	
  sounds	
  like	
  there	
  isn’t	
  a	
  solu6on	
  
–  Neither	
  fashionable	
  solu6on	
  handles	
  everything	
  
4	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
This	
  is	
  not	
  a	
  problem.	
  
	
  It’s	
  an	
  opportunity!	
  
5	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
t	
  
now	
  
Hadoop	
  is	
  Not	
  Very	
  Real-­‐&me	
  
Unprocessed
Data	
  
Fully	
  
processed	
  
Latest	
  full	
  
period	
  
Hadoop	
  job	
  
takes	
  this	
  
long	
  for	
  this	
  
data	
  
6	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
t	
  
now	
  
Hadoop	
  works	
  
great	
  back	
  here	
  
Storm	
  
works	
  
here	
  
Real-­‐&me	
  and	
  Long-­‐&me	
  together	
  
Blended	
  
view	
  
Blended	
  
view	
  
Blended	
  
View	
  
7	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
An	
  Example	
  
8	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Same	
  Problem	
  
9	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
What	
  Does	
  the	
  Lambda	
  Architecture	
  Do?	
  
§  The	
  idea	
  is	
  that	
  we	
  want	
  to	
  compute	
  a	
  func6on	
  of	
  a	
  all	
  history	
  up	
  
to	
  6me	
  n	
  
§  In	
  order	
  to	
  get	
  real-­‐6me	
  response,	
  we	
  divide	
  this	
  into	
  two	
  parts	
  
§  Where	
  the	
  addi6on	
  may	
  not	
  really	
  be	
  addi6on	
  
§  The	
  idea	
  is	
  that	
  if	
  we	
  lose	
  the	
  history	
  from	
  m+1	
  un6l	
  n,	
  things	
  get	
  
beEer	
  soon	
  enough	
  
f x1...xn( )
f x1...xm...xn( )= f x1...xm( )+ f xm+1...xn( )
10	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Can	
  We	
  Do	
  BeLer?	
  
§  Can	
  we	
  minimize	
  or	
  avoid	
  failure	
  transients?	
  
§  Can	
  we	
  guarantee	
  precise	
  boundaries?	
  
§  Can	
  we	
  synchronize	
  computa6ons	
  accurately?	
  
11	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Alterna&ve	
  without	
  Lambda	
  
Search	
  
Engine	
  
NoSql	
  
de	
  Jour	
  
Consumer	
  
Real-­‐6me	
   Long-­‐6me	
  
?	
  
12	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Problems	
  
§  Simply	
  dumping	
  into	
  noSql	
  engine	
  doesn’t	
  quite	
  work	
  
§  Insert	
  rate	
  is	
  limited	
  
§  No	
  load	
  isola6on	
  
–  Big	
  retrospec6ve	
  jobs	
  kill	
  real-­‐6me	
  
§  Low	
  scan	
  performance	
  
–  Hbase	
  preEy	
  good,	
  but	
  not	
  stellar	
  
§  Difficult	
  to	
  set	
  boundaries	
  
–  where	
  does	
  real-­‐6me	
  end	
  and	
  long-­‐6me	
  begin?	
  
13	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Almost	
  a	
  Solu&on	
  
§  Lambda	
  architecture	
  talks	
  about	
  func6on	
  of	
  long-­‐6me	
  state	
  
–  Real-­‐6me	
  approximate	
  accelerator	
  adjusts	
  previous	
  result	
  to	
  current	
  state	
  
§  Sounds	
  good,	
  but	
  …	
  
–  How	
  does	
  the	
  real-­‐6me	
  accelerator	
  combine	
  with	
  long-­‐6me?	
  
–  What	
  algorithms	
  can	
  do	
  this?	
  
–  How	
  can	
  we	
  avoid	
  gaps	
  and	
  overlaps	
  and	
  other	
  errors?	
  
§  Needs	
  more	
  work	
  
§  We	
  need	
  a	
  “λ	
  +	
  ε”	
  architecture	
  !	
  
14	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Simple	
  Example	
  
§  Let’s	
  start	
  with	
  the	
  simplest	
  case	
  …	
  coun6ng	
  
§  Coun6ng	
  =	
  addi6on	
  
–  Addi6on	
  is	
  associa6ve	
  
–  Addi6on	
  is	
  on-­‐line	
  
–  We	
  can	
  generalize	
  these	
  results	
  to	
  all	
  associa6ve,	
  on-­‐line	
  func6ons	
  
–  But	
  let’s	
  start	
  simple	
  
15	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Data	
  
Sources	
  
Catcher	
  
Cluster	
  
Rough	
  Design	
  –	
  Data	
  Flow	
  
Catcher	
  
Cluster	
  
Query	
  Event	
  
Spout	
  
Logger	
  
Bolt	
  
Counter	
  
Bolt	
  
Raw	
  
Logs	
  
Logger	
  
Bolt	
  
Semi	
  
Agg	
  
Hadoop	
  
Aggregator	
  
Snap	
  
Long	
  
agg	
  
ProtoSpout	
  
Counter	
  
Bolt	
  
Logger	
  
Bolt	
  
Data	
  
Sources	
  
16	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  Catcher	
  Protocol	
  
Data	
  
Sources	
  
Catcher	
  
Cluster	
  
Catcher	
  
Cluster	
  
Data	
  
Sources	
  
The	
  data	
  sources	
  and	
  catchers	
  
communicate	
  with	
  a	
  very	
  simple	
  
protocol.	
  
	
  
Hello()	
  =>	
  list	
  of	
  catchers	
  
Log(topic,message)	
  =>	
  	
  
	
  	
  	
  	
  (OK|FAIL,	
  redirect-­‐to-­‐catcher)	
  
17	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  Catcher	
  Queues	
  
Catcher	
  
Cluster	
  
Catcher	
  
Cluster	
  
The	
  catchers	
  forward	
  log	
  requests	
  
to	
  the	
  correct	
  catcher	
  and	
  return	
  
that	
  host	
  in	
  the	
  reply	
  to	
  allow	
  the	
  
client	
  to	
  avoid	
  the	
  extra	
  hop.	
  
	
  
Each	
  topic	
  file	
  is	
  appended	
  by	
  
exactly	
  one	
  catcher.	
  
	
  
Topic	
  files	
  are	
  kept	
  in	
  shared	
  file	
  
storage.	
  
Topic	
  
File	
  
Topic	
  
File	
  
18	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  ProtoSpout	
  
The	
  ProtoSpout	
  tails	
  the	
  topic	
  files,	
  
parses	
  log	
  records	
  into	
  tuples	
  and	
  
injects	
  them	
  into	
  the	
  Storm	
  
topology.	
  
	
  
Last	
  fully	
  acked	
  posi6on	
  stored	
  in	
  
shared,	
  transac6onally	
  correct	
  file	
  
system.	
  
Topic	
  
File	
  
Topic	
  
File	
  
ProtoSpout	
  
19	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  Counter	
  Bolt	
  
§  Cri6cal	
  design	
  goals:	
  
–  fast	
  ack	
  for	
  all	
  tuples	
  
–  fast	
  restart	
  of	
  counter	
  
§  Ack	
  happens	
  when	
  tuple	
  hits	
  the	
  replay	
  log	
  (10’s	
  of	
  milliseconds,	
  
group	
  commit)	
  
§  Restart	
  involves	
  replaying	
  semi-­‐agg’s	
  +	
  replay	
  log	
  (very	
  fast)	
  
§  Replay	
  log	
  only	
  lasts	
  un6l	
  next	
  semi-­‐aggregate	
  goes	
  out	
  
Counter	
  
Bolt	
  
Replay	
  
Log	
  
Semi-­‐
aggregated	
  
records	
  
Incoming	
  
records	
  
Real-­‐6me	
   Long-­‐6me	
  
20	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Frozen	
  Moment	
  in	
  Time	
  
§  Snapshot	
  defines	
  the	
  dividing	
  line	
  
§  All	
  data	
  in	
  the	
  snap	
  is	
  long-­‐6me,	
  all	
  
awer	
  is	
  real-­‐6me	
  
§  Semi-­‐agg	
  strategy	
  allows	
  clean	
  
combina6on	
  of	
  both	
  kinds	
  of	
  data	
  
§  Data	
  synchronized	
  snap	
  not	
  
needed	
  (if	
  the	
  snap	
  is	
  really	
  a	
  snap)	
  
Semi	
  
Agg	
  
Hadoop	
  
Aggregator	
  
Snap	
  
Long	
  
agg	
  
21	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Guarantees	
  
§  Counter	
  output	
  volume	
  is	
  small-­‐ish	
  
–  the	
  greater	
  of	
  k	
  tuples	
  per	
  100K	
  inputs	
  or	
  k	
  tuple/s	
  
–  1	
  tuple/s/label/bolt	
  for	
  this	
  exercise	
  
§  Persistence	
  layer	
  must	
  provide	
  guarantees	
  
–  distributed	
  against	
  node	
  failure	
  
–  must	
  have	
  either	
  readable	
  flush	
  or	
  closed-­‐append	
  
§  HDFS	
  is	
  distributed,	
  but	
  provides	
  no	
  guarantees	
  and	
  strange	
  
seman6cs	
  
§  MapRfs	
  is	
  distributed,	
  provides	
  all	
  necessary	
  guarantees	
  
22	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Presenta&on	
  Layer	
  
§  Presenta6on	
  must	
  
–  read	
  recent	
  output	
  of	
  Logger	
  bolt	
  
–  read	
  relevant	
  output	
  of	
  Hadoop	
  jobs	
  
–  combine	
  semi-­‐aggregated	
  records	
  
§  User	
  will	
  see	
  
–  counts	
  that	
  increment	
  within	
  0-­‐2	
  s	
  of	
  events	
  
–  seamless	
  and	
  accurate	
  meld	
  of	
  short	
  and	
  long-­‐term	
  data	
  
23	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Basic	
  Idea	
  
§  Online	
  algorithms	
  generally	
  have	
  rela6vely	
  small	
  state	
  (like	
  
coun6ng)	
  
§  Online	
  algorithms	
  generally	
  have	
  a	
  simple	
  update	
  (like	
  coun6ng)	
  
§  If	
  we	
  can	
  do	
  this	
  with	
  coun6ng,	
  we	
  can	
  do	
  it	
  with	
  all	
  kinds	
  of	
  
algorithms	
  
24	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Summary	
  –	
  Part	
  1	
  
§  Semi-­‐agg	
  strategy	
  +	
  snapshots	
  allows	
  correct	
  real-­‐6me	
  counts	
  
–  because	
  addi6on	
  is	
  on-­‐line	
  and	
  associa6ve	
  
§  Other	
  on-­‐line	
  associa6ve	
  opera6ons	
  include:	
  
–  k-­‐means	
  clustering	
  (see	
  Dan	
  Filimon’s	
  talk	
  at	
  16.)	
  
–  count	
  dis6nct	
  (see	
  hyper-­‐log-­‐log	
  counters	
  from	
  streamlib	
  or	
  kmv	
  from	
  
Brickhouse)	
  
–  top-­‐k	
  values	
  
–  top-­‐k	
  (count(*))	
  (see	
  streamlib)	
  
–  contextual	
  Bayesian	
  bandits	
  (see	
  part	
  2	
  of	
  this	
  talk)	
  
25	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Example	
  2	
  –	
  AB	
  tes&ng	
  in	
  real-­‐&me	
  
§  I	
  have	
  15	
  versions	
  of	
  my	
  landing	
  page	
  
§  Each	
  visitor	
  is	
  assigned	
  to	
  a	
  version	
  
–  Which	
  version?	
  
§  A	
  conversion	
  or	
  sale	
  or	
  whatever	
  can	
  happen	
  
–  How	
  long	
  to	
  wait?	
  
§  Some	
  versions	
  of	
  the	
  landing	
  page	
  are	
  horrible	
  
–  Don’t	
  want	
  to	
  give	
  them	
  traffic	
  
26	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Quick	
  Diversion	
  
§  You	
  see	
  a	
  coin	
  
–  What	
  is	
  the	
  probability	
  of	
  heads?	
  
–  Could	
  it	
  be	
  larger	
  or	
  smaller	
  than	
  that?	
  
§  I	
  flip	
  the	
  coin	
  and	
  while	
  it	
  is	
  in	
  the	
  air	
  ask	
  again	
  
§  I	
  catch	
  the	
  coin	
  and	
  ask	
  again	
  
§  I	
  look	
  at	
  the	
  coin	
  (and	
  you	
  don’t)	
  and	
  ask	
  again	
  
§  Why	
  does	
  the	
  answer	
  change?	
  
–  And	
  did	
  it	
  ever	
  have	
  a	
  single	
  value?	
  
27	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Philosophical	
  Conclusion	
  
§  Probability	
  as	
  expressed	
  by	
  humans	
  is	
  subjec6ve	
  and	
  depends	
  on	
  
informa6on	
  and	
  experience	
  
28	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Prac&cal	
  Applica&on	
  
29	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
I	
  Dunno	
  
0 0.2 0.4 0.6 0.8 1
p
Prob(p)
30	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
5	
  heads	
  out	
  of	
  10	
  throws	
  
0 0.2 0.4 0.6 0.8 1
p
Prob(p)
31	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
2	
  heads	
  out	
  of	
  12	
  throws	
  
0 0.2 0.4 0.6 0.8 1
p
Prob(p)
Mean	
  
Using	
  any	
  single	
  number	
  as	
  a	
  “best”	
  
es6mate	
  denies	
  the	
  uncertain	
  nature	
  of	
  
a	
  distribu6on	
  
Adding	
  confidence	
  bounds	
  s6ll	
  loses	
  most	
  of	
  
the	
  informa6on	
  in	
  the	
  distribu6on	
  and	
  
prevents	
  good	
  modeling	
  of	
  the	
  tails	
  
32	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Bayesian	
  Bandit	
  
§  Compute	
  distribu6ons	
  based	
  on	
  data	
  
§  Sample	
  p1	
  and	
  p2	
  from	
  these	
  distribu6ons	
  
§  Put	
  a	
  coin	
  in	
  bandit	
  1	
  if	
  p1	
  >	
  p2	
  
§  Else,	
  put	
  the	
  coin	
  in	
  bandit	
  2	
  
33	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
And	
  it	
  works!	
  
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε-greedy, ε = 0.05
Bayesian Bandit with Gamma-Normal
34	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Video	
  Demo	
  
35	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Code	
  
§  Select	
  an	
  alterna6ve	
  
§  Select	
  and	
  learn	
  
§  But	
  we	
  already	
  know	
  how	
  to	
  count!	
  
n = dim(k)[1]!
p0 = rep(0, length.out=n)!
for (i in 1:n) {!
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)!
}!
return (which(p0 == max(p0)))!
for (z in 1:steps) {!
i = select(k)!
j = test(i)!
k[i,j] = k[i,j]+1!
}!
return (k)!
36	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Basic	
  Idea	
  
§  We	
  can	
  encode	
  a	
  distribu6on	
  by	
  sampling	
  
§  Sampling	
  allows	
  unifica6on	
  of	
  explora6on	
  and	
  exploita6on	
  
§  Can	
  be	
  extended	
  to	
  more	
  general	
  response	
  models	
  
§  Note	
  that	
  learning	
  here	
  =	
  coun6ng	
  =	
  on-­‐line	
  algorithm	
  
37	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Generalized	
  Banditry	
  
§  Suppose	
  we	
  have	
  an	
  infinite	
  number	
  of	
  bandits	
  
–  suppose	
  they	
  are	
  each	
  labeled	
  by	
  two	
  real	
  numbers	
  x	
  and	
  y	
  in	
  [0,1]	
  
–  also	
  that	
  expected	
  payoff	
  is	
  a	
  parameterized	
  func6on	
  of	
  x	
  and	
  y	
  
–  now	
  assume	
  a	
  distribu6on	
  for	
  θ	
  that	
  we	
  can	
  learn	
  online	
  
§  Selec6on	
  works	
  by	
  sampling	
  θ,	
  then	
  compu6ng	
  f	
  
§  Learning	
  works	
  by	
  propaga6ng	
  updates	
  back	
  to	
  θ	
  
–  If	
  f	
  is	
  linear,	
  this	
  is	
  very	
  easy	
  
§  Don’t	
  just	
  have	
  to	
  have	
  two	
  labels,	
  could	
  have	
  labels	
  and	
  context	
  
	
  
E z[ ]= f (x, y |θ)
38	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Caveats	
  
§  Original	
  Bayesian	
  Bandit	
  only	
  requires	
  real-­‐6me	
  
§  Generalized	
  Bandit	
  may	
  require	
  access	
  to	
  long	
  history	
  for	
  learning	
  
–  Pseudo	
  online	
  learning	
  may	
  be	
  easier	
  than	
  true	
  online	
  
§  Bandit	
  variables	
  can	
  include	
  content,	
  6me	
  of	
  day,	
  day	
  of	
  week	
  
§  Context	
  variables	
  can	
  include	
  user	
  id,	
  user	
  features	
  
§  Bandit	
  ×	
  context	
  variables	
  provide	
  the	
  real	
  power	
  
39	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
§  Contact:	
  
–  tdunning@maprtech.com	
  
–  @ted_dunning	
  
§  Slides	
  and	
  such	
  (just	
  don’t	
  believe	
  the	
  metrics):	
  
–  hEp://slideshare.net/tdunning	
  
§  Hash	
  tags:	
  #mapr	
  #storm	
  
	
  	
  
40	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Thank	
  You	
  

Contenu connexe

Tendances

G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and CassandraChris Lohfink
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVMaragozin
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCTatiana Al-Chueyr
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組みRyousei Takano
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymSri Ambati
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.J On The Beach
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and presentGordon Chung
 
Tuning Java for Big Data
Tuning Java for Big DataTuning Java for Big Data
Tuning Java for Big DataScott Seighman
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 

Tendances (20)

Pig on Storm
Pig on StormPig on Storm
Pig on Storm
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and Cassandra
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 
Garbage collection in JVM
Garbage collection in JVMGarbage collection in JVM
Garbage collection in JVM
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPC
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
G1GC
G1GCG1GC
G1GC
 
Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.Moving to g1 gc by Kirk Pepperdine.
Moving to g1 gc by Kirk Pepperdine.
 
-XX:+UseG1GC
-XX:+UseG1GC-XX:+UseG1GC
-XX:+UseG1GC
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Tuning Java for Big Data
Tuning Java for Big DataTuning Java for Big Data
Tuning Java for Big Data
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 

En vedette

Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQLHBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQLMapR Technologies
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 

En vedette (7)

Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQLHBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Similaire à Storm Users Group Real Time Hadoop

Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
Slices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarSlices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarGlobalLogic Ukraine
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...DataStax Academy
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timesubhojit banerjee
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...DataStax
 

Similaire à Storm Users Group Real Time Hadoop (20)

Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
Slices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarSlices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr Bodnar
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real time
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
 

Plus de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 

Dernier (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 

Storm Users Group Real Time Hadoop

  • 1. 1  ©MapR  Technologies  -­‐  Confiden6al   Real-­‐&me  Learning  with  Hadoop   The  “λ  +  ε”  architecture  
  • 2. 2  ©MapR  Technologies  -­‐  Confiden6al   §  Contact:   –  tdunning@maprtech.com   –  @ted_dunning   §  Slides  and  such  (available  late  tonight):   –  hEp://slideshare.net/tdunning   §  Hash  tags:  #mapr  #storm      
  • 3. 3  ©MapR  Technologies  -­‐  Confiden6al   The  Challenge   §  Hadoop  is  great  of  processing  vats  of  data   –  But  sucks  for  real-­‐6me  (by  design!)     §  Storm  is  great  for  real-­‐6me  processing   –  But  lacks  any  way  to  deal  with  batch  processing   §  It  sounds  like  there  isn’t  a  solu6on   –  Neither  fashionable  solu6on  handles  everything  
  • 4. 4  ©MapR  Technologies  -­‐  Confiden6al   This  is  not  a  problem.    It’s  an  opportunity!  
  • 5. 5  ©MapR  Technologies  -­‐  Confiden6al   t   now   Hadoop  is  Not  Very  Real-­‐&me   Unprocessed Data   Fully   processed   Latest  full   period   Hadoop  job   takes  this   long  for  this   data  
  • 6. 6  ©MapR  Technologies  -­‐  Confiden6al   t   now   Hadoop  works   great  back  here   Storm   works   here   Real-­‐&me  and  Long-­‐&me  together   Blended   view   Blended   view   Blended   View  
  • 7. 7  ©MapR  Technologies  -­‐  Confiden6al   An  Example  
  • 8. 8  ©MapR  Technologies  -­‐  Confiden6al   The  Same  Problem  
  • 9. 9  ©MapR  Technologies  -­‐  Confiden6al   What  Does  the  Lambda  Architecture  Do?   §  The  idea  is  that  we  want  to  compute  a  func6on  of  a  all  history  up   to  6me  n   §  In  order  to  get  real-­‐6me  response,  we  divide  this  into  two  parts   §  Where  the  addi6on  may  not  really  be  addi6on   §  The  idea  is  that  if  we  lose  the  history  from  m+1  un6l  n,  things  get   beEer  soon  enough   f x1...xn( ) f x1...xm...xn( )= f x1...xm( )+ f xm+1...xn( )
  • 10. 10  ©MapR  Technologies  -­‐  Confiden6al   Can  We  Do  BeLer?   §  Can  we  minimize  or  avoid  failure  transients?   §  Can  we  guarantee  precise  boundaries?   §  Can  we  synchronize  computa6ons  accurately?  
  • 11. 11  ©MapR  Technologies  -­‐  Confiden6al   Alterna&ve  without  Lambda   Search   Engine   NoSql   de  Jour   Consumer   Real-­‐6me   Long-­‐6me   ?  
  • 12. 12  ©MapR  Technologies  -­‐  Confiden6al   Problems   §  Simply  dumping  into  noSql  engine  doesn’t  quite  work   §  Insert  rate  is  limited   §  No  load  isola6on   –  Big  retrospec6ve  jobs  kill  real-­‐6me   §  Low  scan  performance   –  Hbase  preEy  good,  but  not  stellar   §  Difficult  to  set  boundaries   –  where  does  real-­‐6me  end  and  long-­‐6me  begin?  
  • 13. 13  ©MapR  Technologies  -­‐  Confiden6al   Almost  a  Solu&on   §  Lambda  architecture  talks  about  func6on  of  long-­‐6me  state   –  Real-­‐6me  approximate  accelerator  adjusts  previous  result  to  current  state   §  Sounds  good,  but  …   –  How  does  the  real-­‐6me  accelerator  combine  with  long-­‐6me?   –  What  algorithms  can  do  this?   –  How  can  we  avoid  gaps  and  overlaps  and  other  errors?   §  Needs  more  work   §  We  need  a  “λ  +  ε”  architecture  !  
  • 14. 14  ©MapR  Technologies  -­‐  Confiden6al   A  Simple  Example   §  Let’s  start  with  the  simplest  case  …  coun6ng   §  Coun6ng  =  addi6on   –  Addi6on  is  associa6ve   –  Addi6on  is  on-­‐line   –  We  can  generalize  these  results  to  all  associa6ve,  on-­‐line  func6ons   –  But  let’s  start  simple  
  • 15. 15  ©MapR  Technologies  -­‐  Confiden6al   Data   Sources   Catcher   Cluster   Rough  Design  –  Data  Flow   Catcher   Cluster   Query  Event   Spout   Logger   Bolt   Counter   Bolt   Raw   Logs   Logger   Bolt   Semi   Agg   Hadoop   Aggregator   Snap   Long   agg   ProtoSpout   Counter   Bolt   Logger   Bolt   Data   Sources  
  • 16. 16  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Catcher  Protocol   Data   Sources   Catcher   Cluster   Catcher   Cluster   Data   Sources   The  data  sources  and  catchers   communicate  with  a  very  simple   protocol.     Hello()  =>  list  of  catchers   Log(topic,message)  =>            (OK|FAIL,  redirect-­‐to-­‐catcher)  
  • 17. 17  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Catcher  Queues   Catcher   Cluster   Catcher   Cluster   The  catchers  forward  log  requests   to  the  correct  catcher  and  return   that  host  in  the  reply  to  allow  the   client  to  avoid  the  extra  hop.     Each  topic  file  is  appended  by   exactly  one  catcher.     Topic  files  are  kept  in  shared  file   storage.   Topic   File   Topic   File  
  • 18. 18  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  ProtoSpout   The  ProtoSpout  tails  the  topic  files,   parses  log  records  into  tuples  and   injects  them  into  the  Storm   topology.     Last  fully  acked  posi6on  stored  in   shared,  transac6onally  correct  file   system.   Topic   File   Topic   File   ProtoSpout  
  • 19. 19  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Counter  Bolt   §  Cri6cal  design  goals:   –  fast  ack  for  all  tuples   –  fast  restart  of  counter   §  Ack  happens  when  tuple  hits  the  replay  log  (10’s  of  milliseconds,   group  commit)   §  Restart  involves  replaying  semi-­‐agg’s  +  replay  log  (very  fast)   §  Replay  log  only  lasts  un6l  next  semi-­‐aggregate  goes  out   Counter   Bolt   Replay   Log   Semi-­‐ aggregated   records   Incoming   records   Real-­‐6me   Long-­‐6me  
  • 20. 20  ©MapR  Technologies  -­‐  Confiden6al   A  Frozen  Moment  in  Time   §  Snapshot  defines  the  dividing  line   §  All  data  in  the  snap  is  long-­‐6me,  all   awer  is  real-­‐6me   §  Semi-­‐agg  strategy  allows  clean   combina6on  of  both  kinds  of  data   §  Data  synchronized  snap  not   needed  (if  the  snap  is  really  a  snap)   Semi   Agg   Hadoop   Aggregator   Snap   Long   agg  
  • 21. 21  ©MapR  Technologies  -­‐  Confiden6al   Guarantees   §  Counter  output  volume  is  small-­‐ish   –  the  greater  of  k  tuples  per  100K  inputs  or  k  tuple/s   –  1  tuple/s/label/bolt  for  this  exercise   §  Persistence  layer  must  provide  guarantees   –  distributed  against  node  failure   –  must  have  either  readable  flush  or  closed-­‐append   §  HDFS  is  distributed,  but  provides  no  guarantees  and  strange   seman6cs   §  MapRfs  is  distributed,  provides  all  necessary  guarantees  
  • 22. 22  ©MapR  Technologies  -­‐  Confiden6al   Presenta&on  Layer   §  Presenta6on  must   –  read  recent  output  of  Logger  bolt   –  read  relevant  output  of  Hadoop  jobs   –  combine  semi-­‐aggregated  records   §  User  will  see   –  counts  that  increment  within  0-­‐2  s  of  events   –  seamless  and  accurate  meld  of  short  and  long-­‐term  data  
  • 23. 23  ©MapR  Technologies  -­‐  Confiden6al   The  Basic  Idea   §  Online  algorithms  generally  have  rela6vely  small  state  (like   coun6ng)   §  Online  algorithms  generally  have  a  simple  update  (like  coun6ng)   §  If  we  can  do  this  with  coun6ng,  we  can  do  it  with  all  kinds  of   algorithms  
  • 24. 24  ©MapR  Technologies  -­‐  Confiden6al   Summary  –  Part  1   §  Semi-­‐agg  strategy  +  snapshots  allows  correct  real-­‐6me  counts   –  because  addi6on  is  on-­‐line  and  associa6ve   §  Other  on-­‐line  associa6ve  opera6ons  include:   –  k-­‐means  clustering  (see  Dan  Filimon’s  talk  at  16.)   –  count  dis6nct  (see  hyper-­‐log-­‐log  counters  from  streamlib  or  kmv  from   Brickhouse)   –  top-­‐k  values   –  top-­‐k  (count(*))  (see  streamlib)   –  contextual  Bayesian  bandits  (see  part  2  of  this  talk)  
  • 25. 25  ©MapR  Technologies  -­‐  Confiden6al   Example  2  –  AB  tes&ng  in  real-­‐&me   §  I  have  15  versions  of  my  landing  page   §  Each  visitor  is  assigned  to  a  version   –  Which  version?   §  A  conversion  or  sale  or  whatever  can  happen   –  How  long  to  wait?   §  Some  versions  of  the  landing  page  are  horrible   –  Don’t  want  to  give  them  traffic  
  • 26. 26  ©MapR  Technologies  -­‐  Confiden6al   A  Quick  Diversion   §  You  see  a  coin   –  What  is  the  probability  of  heads?   –  Could  it  be  larger  or  smaller  than  that?   §  I  flip  the  coin  and  while  it  is  in  the  air  ask  again   §  I  catch  the  coin  and  ask  again   §  I  look  at  the  coin  (and  you  don’t)  and  ask  again   §  Why  does  the  answer  change?   –  And  did  it  ever  have  a  single  value?  
  • 27. 27  ©MapR  Technologies  -­‐  Confiden6al   A  Philosophical  Conclusion   §  Probability  as  expressed  by  humans  is  subjec6ve  and  depends  on   informa6on  and  experience  
  • 28. 28  ©MapR  Technologies  -­‐  Confiden6al   A  Prac&cal  Applica&on  
  • 29. 29  ©MapR  Technologies  -­‐  Confiden6al   I  Dunno   0 0.2 0.4 0.6 0.8 1 p Prob(p)
  • 30. 30  ©MapR  Technologies  -­‐  Confiden6al   5  heads  out  of  10  throws   0 0.2 0.4 0.6 0.8 1 p Prob(p)
  • 31. 31  ©MapR  Technologies  -­‐  Confiden6al   2  heads  out  of  12  throws   0 0.2 0.4 0.6 0.8 1 p Prob(p) Mean   Using  any  single  number  as  a  “best”   es6mate  denies  the  uncertain  nature  of   a  distribu6on   Adding  confidence  bounds  s6ll  loses  most  of   the  informa6on  in  the  distribu6on  and   prevents  good  modeling  of  the  tails  
  • 32. 32  ©MapR  Technologies  -­‐  Confiden6al   Bayesian  Bandit   §  Compute  distribu6ons  based  on  data   §  Sample  p1  and  p2  from  these  distribu6ons   §  Put  a  coin  in  bandit  1  if  p1  >  p2   §  Else,  put  the  coin  in  bandit  2  
  • 33. 33  ©MapR  Technologies  -­‐  Confiden6al   And  it  works!   11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε-greedy, ε = 0.05 Bayesian Bandit with Gamma-Normal
  • 34. 34  ©MapR  Technologies  -­‐  Confiden6al   Video  Demo  
  • 35. 35  ©MapR  Technologies  -­‐  Confiden6al   The  Code   §  Select  an  alterna6ve   §  Select  and  learn   §  But  we  already  know  how  to  count!   n = dim(k)[1]! p0 = rep(0, length.out=n)! for (i in 1:n) {! p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)! }! return (which(p0 == max(p0)))! for (z in 1:steps) {! i = select(k)! j = test(i)! k[i,j] = k[i,j]+1! }! return (k)!
  • 36. 36  ©MapR  Technologies  -­‐  Confiden6al   The  Basic  Idea   §  We  can  encode  a  distribu6on  by  sampling   §  Sampling  allows  unifica6on  of  explora6on  and  exploita6on   §  Can  be  extended  to  more  general  response  models   §  Note  that  learning  here  =  coun6ng  =  on-­‐line  algorithm  
  • 37. 37  ©MapR  Technologies  -­‐  Confiden6al   Generalized  Banditry   §  Suppose  we  have  an  infinite  number  of  bandits   –  suppose  they  are  each  labeled  by  two  real  numbers  x  and  y  in  [0,1]   –  also  that  expected  payoff  is  a  parameterized  func6on  of  x  and  y   –  now  assume  a  distribu6on  for  θ  that  we  can  learn  online   §  Selec6on  works  by  sampling  θ,  then  compu6ng  f   §  Learning  works  by  propaga6ng  updates  back  to  θ   –  If  f  is  linear,  this  is  very  easy   §  Don’t  just  have  to  have  two  labels,  could  have  labels  and  context     E z[ ]= f (x, y |θ)
  • 38. 38  ©MapR  Technologies  -­‐  Confiden6al   Caveats   §  Original  Bayesian  Bandit  only  requires  real-­‐6me   §  Generalized  Bandit  may  require  access  to  long  history  for  learning   –  Pseudo  online  learning  may  be  easier  than  true  online   §  Bandit  variables  can  include  content,  6me  of  day,  day  of  week   §  Context  variables  can  include  user  id,  user  features   §  Bandit  ×  context  variables  provide  the  real  power  
  • 39. 39  ©MapR  Technologies  -­‐  Confiden6al   §  Contact:   –  tdunning@maprtech.com   –  @ted_dunning   §  Slides  and  such  (just  don’t  believe  the  metrics):   –  hEp://slideshare.net/tdunning   §  Hash  tags:  #mapr  #storm      
  • 40. 40  ©MapR  Technologies  -­‐  Confiden6al   Thank  You