SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
IN-­‐STREAM	
  PROCESSING	
  WITH	
  KAFKA,	
  STORM,	
  
CASSANDRA	
  

Integral	
  Ad	
  Science	
  

Niagara	
  Team	
  

Kiril	
  Tsemekhman	
  

Alexey	
  Kharlamov	
  

Rahul	
  Ratnakar	
  

Evelina	
  Stepanova	
  
Rafael	
  Bagmanov	
  
KonstanGn	
  Golikov	
  
Anatoliy	
  Vinogradov	
  

	
  
Business	
  goals	
  
•  “Real-­‐Gme”	
  data	
  availability	
  
•  Near-­‐real-­‐Gme	
  update	
  of	
  fraud	
  models	
  
•  Controlled	
  data	
  delay	
  

•  BePer	
  hardware	
  scaling	
  
•  Summarizing	
  data	
  as	
  close	
  to	
  source	
  as	
  possible	
  
•  BePer	
  network	
  uGlizaGon	
  and	
  reliability	
  

	
  	
  

	
  
Data	
  flow	
  
1.6B records/day
Sustained 100KQPS
Peak
200KQPS

s
og
QL

Score

Join

Filter

Initial Events
Aggregate

Reporting
Database

DT Events
Ev

ida

Reports

nc

e
Data	
  flow	
  –	
  Hadoop	
  inefficiency	
  hypothesis	
  
•  Large	
  batch	
  architecture	
  for	
  offline	
  processing	
  
•  Hadoop’s	
  shuffle	
  phase	
  dumps	
  data	
  to	
  disks	
  
•  Several	
  Gmes	
  in	
  some	
  cases!!!	
  

•  AcGve	
  dataset	
  fits	
  into	
  cluster	
  memory	
  
•  SessionizaGon	
  –	
  10’s	
  of	
  GB	
  
•  AggregaGon	
  –	
  10’s	
  of	
  GB	
  

•  RAM	
  is	
  1000’s	
  Gmes	
  faster	
  than	
  HDD	
  

	
  	
  

	
  
In-­‐stream	
  processing	
  –	
  Benefits	
  	
  
•  Immediately	
  available	
  results	
  
•  Results	
  	
  are	
  delivered	
  with	
  controlled	
  delay	
  (15	
  mins	
  –	
  1hour)	
  
•  Time-­‐sensiGve	
  models	
  (e.g.	
  fraud)	
  are	
  updated	
  in	
  near	
  real-­‐Gme	
  
•  Data	
  can	
  be	
  delivered	
  to	
  clients	
  immediately	
  

•  Efficient	
  resource	
  uGlizaGon	
  
•  BePer	
  scaling	
  coefficient	
  
•  Smoother	
  workload	
  and	
  bandwidth	
  distribuGon	
  
•  Less	
  resource	
  overprovisioning	
  	
  

	
  
	
  	
  

	
  
Non-­‐FuncGonal	
  Requirements	
  
•  Horizontal	
  scalability	
  
•  Limit	
  on	
  data	
  loss	
  (less	
  than	
  0.1%)	
  
•  Tolerance	
  to	
  single	
  node	
  failure	
  
•  Ops	
  guys	
  will	
  sleep	
  bePer	
  at	
  night	
  
•  It	
  happens!!!	
  
•  Easy	
  recovery	
  

•  Maintenance	
  
•  No	
  data	
  loss	
  on	
  deployment	
  	
  
•  Monitoring	
  &	
  alerGng	
  
Storm/Trident/Kaea/C*	
  –	
  Hybrid	
  soluGon	
  
Ev

en

ts

Events

n
Eve

Storm

QLogs

ts

Kafka

Kafka

Exporter

Cassandra

Reports

Reporting DB
Storm/Trident/Kaea/C*	
  –	
  Reliable	
  processing	
  
•  Storm	
  &	
  Trident	
  transacGons	
  
•  Data	
  are	
  processed	
  by	
  micro-­‐batches	
  (transacGons)	
  
•  External	
  storage	
  used	
  to	
  keep	
  state	
  between	
  transacGons	
  
•  AutomaGc	
  rollback	
  to	
  last	
  checkpoint	
  

•  Kaea	
  –	
  distributed	
  queue	
  manager	
  
•  Data	
  feed	
  replay	
  for	
  retry	
  or	
  recovery	
  
•  Load	
  spikes	
  are	
  smoothed	
  	
  
•  Cross-­‐DC	
  replicaGon	
  	
  

•  Cassandra	
  
•  Key-­‐value	
  store	
  for	
  de-­‐duplicaGon	
  
•  Resilience	
  based	
  on	
  replicaGon	
  
Our	
  Storm	
  distribuGon	
  
•  Storm	
  High-­‐Availability	
  
•  Share	
  Nimbus	
  state	
  through	
  distributed	
  cache	
  

•  Metrics	
  streaming	
  to	
  Graphite	
  
•  Bug	
  fixes	
  
•  Packaging	
  into	
  RPM/DEB	
  

	
  	
  

	
  
Data	
  Sources	
  
Frontend Server
Server

Tailer
Agent

Msg
...
Msg
Mark

Log
Files

Check
point

•  Hard	
  latency	
  requirements	
  
•  10ms	
  response	
  
•  Read	
  logs	
  produced	
  by	
  front-­‐
end	
  servers	
  
•  Periodic	
  checkpoints	
  
•  Older	
  data	
  dropped	
  	
  
Message	
  JiPer	
  	
  
Time

Server 1

Server 2

Server 3

Data feed

•  Messages	
  are	
  arbitrarily	
  reordered	
  
•  Use	
  jiPer	
  buffer	
  and	
  drop	
  outliers	
  	
  
Data	
  flow	
  
1.6B records/day
Sustained 100KQPS
Peak
200KQPS

s
og
QL

Score

Join

Filter

Initial Events
Aggregate

Reporting
Database

DT Events
Ev

ida

Reports

nc

e
Robot	
  filtering	
  
•  Blacklist/whitelist	
  based	
  
•  Simultaneous	
  mulG-­‐paPern	
  matching	
  
with	
  Aho-­‐Corasick	
  algorithm	
  

•  60x	
  improvement	
  over	
  frequency	
  
opGmized	
  String.indexOf	
  

	
  	
  

	
  
Join	
  (SessionizaGon)	
  –	
  Algorithm	
  
Init

DT

DT

Timeout

DT

Impression 1
Join Window
Init

DT

DT

DT

Unload
Emit

Impression 2

Emit
DT

DT

DT

Timeout

Impression 3

Drop

Transaction 1

Transaction 2

Time
Recovery	
  strategy	
  
Rollback
Transaction N

Processing/Failure
point

Data Feed
Checkpoint N - 1

Checkpoint N

•  Read	
  data	
  in	
  batches	
  (transacGon)	
  
•  On	
  success	
  write	
  checkpoint	
  
•  On	
  failure	
  return	
  to	
  previous	
  checkpoint	
  
•  On	
  catastrophic	
  failure	
  rewind	
  data	
  feed	
  to	
  a	
  point	
  
before	
  the	
  problem	
  started	
  
Logical	
  Gme	
  
12:04
12:01
12:00
12:05
12:00
11:59
11:58
12:00
11:59
12:00

•  Wall-­‐clock	
  does	
  not	
  work	
  
Late
Delivery

Read point

•  Load	
  spikes	
  	
  
•  Recovery	
  rewinds	
  data	
  feed	
  to	
  
previous	
  Gme	
  

•  Logical	
  clock	
  
•  Maximum	
  Gmestamp	
  seen	
  by	
  Bolt	
  
•  New	
  messages	
  with	
  smaller	
  
Gmestamp	
  are	
  late	
  
•  No	
  clock	
  synchronizaGon	
  
•  All	
  bolts	
  are	
  in	
  “weak	
  synchrony”	
  
Join	
  topology	
  	
  
Reads	
  data	
  from	
  Kaea	
  

Parse	
  logs	
  and	
  gets	
  ASID	
  
Bolt	
  

Joins	
  events	
  with	
  DT	
  by	
  ASID	
  

tx1:ASID1:Event

Store	
  joins	
  for	
  fast	
  failover	
  

Bolt	
  
ASID1: Event,DT

tx1:event logs
batch

Spout	
  

Bolt	
  
tx1:ASID1:DT
ASID1 : Event, DTs
ASID2 : Event, DTs
...

tx1:event logs
batch

ASID2:Event

Bolt	
  

Bolt	
  
tx1:ASID2:Event

	
  	
  

	
  
Join	
  (SessionizaGon)	
  topology	
  
•  Flow	
  
•  Microbatches	
  of	
  iniGal	
  and	
  ping	
  events	
  are	
  read	
  from	
  
Kaea	
  
•  Map	
  ASID	
  -­‐>	
  TTL,	
  {IniGal,	
  ping_x}	
  is	
  stored	
  in	
  memory	
  
•  STATE	
  is	
  mirrored	
  to	
  Cassandra!!!	
  

•  Failure	
  recovery	
  
•  Lost	
  state	
  is	
  recovered	
  from	
  Cassandra	
  

•  FinalizaGon	
  –	
  on	
  each	
  transacGon:	
  
•  Process	
  state	
  is	
  commiPed	
  and	
  messages	
  are	
  evicted	
  to	
  
Kaea	
  
•  Evicted	
  messages	
  are	
  deleted	
  from	
  Cassandra	
  
Failure
TX: 1

Retry

TX: 2

TX: 2R

APC-1-1-1

100

APC-1-1-2

57

37

. . .

. . .

. . .

APC-2-1-4

14

214

Transaction 1

TX: 3

32

Transaction 2

46

37

28

. . .
214

. . .
82

Transaction 3
AggregaGon	
  
•  Flow	
  
•  Group	
  events	
  by	
  reporGng	
  dimensions	
  
•  Count	
  events	
  in	
  groups	
  
•  Write	
  counters	
  to	
  C*:	
  
• Row	
  key	
  =	
  PREFIX(GROUPID)	
  
• Column	
  key	
  =	
  SUFFIX(GROUPID)	
  +	
  TXID	
  
• Value	
  =	
  COUNT(*)	
  

•  Failure	
  recovery	
  
•  Overwrite	
  results	
  of	
  failed	
  batch	
  in	
  Cassandra	
  

•  FinalizaGon	
  
•  Read	
  data	
  from	
  Cassandra	
  by	
  parallel	
  extractor	
  
	
  
Cassandra	
  Data	
  Export	
  	
  
•  Delayed	
  data	
  export	
  to	
  accommodate	
  
ji;er	
  
•  Parallel	
  read	
  from	
  several	
  points	
  on	
  the	
  
ring	
  
•  On	
  major	
  incident	
  recovery	
  –	
  re-­‐export	
  
data	
  
•  Dirty	
  hack	
  -­‐	
  aggregaEon	
  topology	
  should	
  
stream	
  data	
  to	
  analyEcal	
  database	
  
directly	
  

Read point

Re

p
ad

oin

t

Cassandra Nodes

R
ea
d
po
i
nt
APC	
  aggregaGon	
  topology	
  (snippet)	
  
Stream	
  stream	
  =	
  topology.	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  newStream("input-­‐topic-­‐reading-­‐spout",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  OpaqueTridentKaOaSpout(config.groupingTridentKaOaConfig())).	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  shuffle().	
  	
  
	
  each(fields("transformed-­‐event"),	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  ExtractFields("niagara.storm.adpoc.group.ApcGroupingFieldsParser"),	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fields("grouping"))	
  
	
  stream.groupBy(fields("bucket",	
  "grouping")).name("aggregaEon").	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  persistentAggregate(	
  /*	
  cassandra	
  store	
  */),	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fields("transformed-­‐event"),	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  new	
  FirewallRecordCount(false),	
  	
  fields("value"))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  

	
  
Storm	
  

Storm	
  

Storm	
  

hi1.4xlarge

hi1.4xlarge

Amazon	
  
S3	
  
Reader	
  

hi1.4xlarge

hi1.4xlarge

C*

hi1.4xlarge

hi1.4xlarge

hi1.4xlarge

Performance	
  environment	
  

C*

C*
Performance	
  environment	
  -­‐	
  VMs	
  
Instance	
  type	
  
Spec	
  

vCPU	
  
cores	
  

Memory	
  
(GB)	
  

Storage	
  
(GB)	
  

6	
  hi1.4xlarge	
  

16	
  

60	
  

2	
  x	
  1,024	
  
SSD	
  

•  Network	
  10Gbit/sec	
  
•  SSD	
  
•  16000	
  IOPS	
  
•  SequenGal	
  write	
  –	
  300MB/sec	
  x	
  2	
  
AggregaGon:	
  Apc	
  +	
  S05	
  +	
  L04	
  
AggregaGon	
  –	
  Write	
  performance	
  	
  
Report
Storm
Cassandra
	
  
Size, KNet,
Net, MB/ Disk W,
Name
T, K/s CPU, %
CPU, %
rows
MB/s
s
MB/s
	
  
apc
29
220
83
127	
  
101.5/1.5
1,3
	
  
l04
249
200
80
125	
  
125/2.5
5
	
  
s05
7400
120	
  
85
80
4015/7	
  
30
United
7678
110	
  
87
70
5015/7	
  
36
	
  
•  Network	
  saturaGon	
  observed	
  
•  Throughput	
  10x	
  required	
  value	
  
	
  
AggregaGon	
  –	
  Read	
  performance	
  
Report
S05
S05
APC
L04

Size,
Time, Write, KM/
Krows Threads sec sec
7400
10
95	
  
60	
  
7400
3 150	
  
80	
  
29
3
15
120
249
3
25
120

•  Read	
  of	
  1	
  hour	
  worth	
  of	
  data	
  takes	
  2.5	
  minutes	
  
•  Moderate	
  performance	
  degradaGon	
  observed	
  
AggregaGon	
  –	
  Long	
  transacGons	
  
•  Pros	
  
•  BePer	
  aggregaGon	
  raGo	
  
•  Faster	
  report	
  export	
  

•  Cons	
  
•  Batch	
  type	
  of	
  workload	
  (spiky)	
  
•  Longer	
  recovery	
  

•  Test	
  results	
  
•  Performance	
  up	
  to	
  200	
  Kmsg/sec	
  
•  Cassandra	
  unstable	
  due	
  to	
  node	
  failures	
  (GC)	
  
Join	
  Topology	
  –	
  Output	
  Stream	
  
Krecord/sec
30

28

25

20
20

15

12.5

13

C: ON, U: OFF,
S:ON

C: ON, U: ON,
S:ON

KMsg/psec

10

5

0

C: OFF, U: ON,
S:ON

C: OFF, U: ON,
S:OFF
Join	
  Topology	
  –	
  In-­‐memory	
  state	
  size	
  
•  Stream	
  frequency	
  9.7	
  Kmsg/lsec	
  
•  Join	
  window	
  –	
  15	
  minutes	
  
State, State size
Unload K/task Mb/task

Total heap,
GB
Total state, GB

OFF

500

500

60

24

ON

230

240

36

11.5
m2.4xlarge

hi1.4xlarge

Storm	
  
hi1.4xlarge

m2.4xlarge

hi1.4xlarge

hi1.4xlarge

m2.4xlarge

hi1.4xlarge

C*
hi1.4xlarge

Scalability	
  –	
  Test	
  environment	
  
Storm	
  

Storm	
  

C*
C*
Scalability	
  –	
  Test	
  results	
  
Output Stream(vCPU), Kmsg/sec
50
45

43

40
35
30
27

25

C:OFF
C:ON

20
16

15
10

9
6

5
0
24

48

72
Join	
  topology	
  -­‐	
  Conclusion	
  
•  Linear	
  scaling	
  limited	
  by	
  CPU	
  
•  72GB	
  RAM	
  for	
  10Kmsg/sec	
  30	
  min	
  window	
  	
  
•  6	
  hi1.4xlarge	
  instances	
  can	
  process	
  	
  
•  16	
  Kmsg/sec	
  (lower	
  bound)	
  
•  28	
  Kmsg/sec	
  (upper	
  bound)	
  
•  OpGmizaGon	
  
•  Stateless	
  join	
  topology	
  
Topology	
  performance	
  consideraGons	
  
● 

Avoid	
  text	
  and	
  regular	
  expressions.	
  POJO	
  are	
  friends	
  

● 

Network	
  bandwidth	
  is	
  important	
  (saturated	
  at	
  1Gb/s)	
  

● 

Parsing	
  is	
  heavy	
  (deserializaGon	
  included)	
  	
  
● 
● 

● 

Track	
  cases	
  when	
  tuple	
  crosses	
  worker	
  boundary	
  	
  
Keep	
  spouts	
  separate	
  from	
  parsers	
  

Lazy	
  parsing	
  (extract	
  fields	
  when	
  you	
  really	
  have	
  to)	
  
● 
● 

	
  

Less	
  garbage	
  with	
  shorter	
  lifeGme	
  	
  
Easier	
  profiling	
  
Cassandra	
  performance
● 

	
  	
  

Garbage	
  Collector	
  is	
  main	
  performance	
  limit	
  
○ 

Cassandra	
  uses	
  Concurrent-­‐Mark-­‐Sweep	
  (CMS)	
  by	
  default	
  

○ 

If	
  CMS	
  cannot	
  free	
  enough	
  space,	
  JVM	
  will	
  fall	
  back	
  to	
  single-­‐
threaded	
  serial	
  collector	
  (‘stop	
  the	
  world’)	
  

○ 

With	
  12	
  GB	
  heap	
  size	
  serial	
  collector	
  might	
  take	
  up	
  to	
  60	
  seconds	
  
Cassandra:	
  conclusions
● 

Garbage	
  collecGon	
  
○ 
○ 

● 

	
  	
  

Spiky	
  (batch	
  type)	
  workload	
  is	
  bad	
  for	
  C*	
  
The	
  smaller	
  the	
  cluster,	
  the	
  less	
  heavy-­‐write	
  column	
  
families	
  you	
  can	
  have	
  

Wide	
  tables	
  are	
  bePer	
  than	
  narrow	
  tables	
  	
  
○ 

1	
  row	
  with	
  10	
  columns	
  is	
  bePer	
  than	
  10	
  rows	
  with	
  1	
  
column	
  in	
  terms	
  of	
  throughput	
  	
  
QuesGons?	
  
Alexey Kharlamov
@aih1013
aharlamov@gmail.com
Kiril Tsemekhman
kiril@integralads.com

Rahul Ratnakar
rahul@integralads.com

Contenu connexe

Tendances

Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayJacob Park
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandSpark Summit
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQLSATOSHI TAGOMORI
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 

Tendances (20)

Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 

En vedette

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeGuido Schmutz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Integral semiannual review 2012 - q3-q4
Integral semiannual review   2012 - q3-q4Integral semiannual review   2012 - q3-q4
Integral semiannual review 2012 - q3-q4Romain Fonnier
 
Real time Analytics Using Storm and Kafka
Real time Analytics Using Storm and KafkaReal time Analytics Using Storm and Kafka
Real time Analytics Using Storm and KafkaSidhartha Ray
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Sharding with spider solutions 20160721
Sharding with spider solutions 20160721Sharding with spider solutions 20160721
Sharding with spider solutions 20160721Kentoku
 
Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structuresRan Silberman
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threadsmperham
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsJohan Andrén
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedEdureka!
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Martin Zapletal
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at DevoxxNathan Bijnens
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management toolsRan Silberman
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in productionKentoku
 

En vedette (20)

Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Integral semiannual review 2012 - q3-q4
Integral semiannual review   2012 - q3-q4Integral semiannual review   2012 - q3-q4
Integral semiannual review 2012 - q3-q4
 
Real time Analytics Using Storm and Kafka
Real time Analytics Using Storm and KafkaReal time Analytics Using Storm and Kafka
Real time Analytics Using Storm and Kafka
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Sharding with spider solutions 20160721
Sharding with spider solutions 20160721Sharding with spider solutions 20160721
Sharding with spider solutions 20160721
 
Clash of clans data structures
Clash of clans   data structuresClash of clans   data structures
Clash of clans data structures
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Dev ops for big data cluster management tools
Dev ops for big data  cluster management toolsDev ops for big data  cluster management tools
Dev ops for big data cluster management tools
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in production
 

Similaire à Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013

High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureDataStax Academy
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writesInstaclustr
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ DevicesAmazon Web Services
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsTom Van den Bulck
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Eugene
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scaleVinay Kumar Chella
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
 
刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sqlhdhappy001
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Caserandom_chance
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and BeyondScyllaDB
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraLuke Tillman
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 

Similaire à Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013 (20)

High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
(GAM406) Glu Mobile: Real-time Analytics Processing og 10 MM+ Devices
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Case
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and Beyond
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 

Dernier

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Dernier (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013

  • 1. IN-­‐STREAM  PROCESSING  WITH  KAFKA,  STORM,   CASSANDRA   Integral  Ad  Science   Niagara  Team   Kiril  Tsemekhman   Alexey  Kharlamov   Rahul  Ratnakar   Evelina  Stepanova   Rafael  Bagmanov   KonstanGn  Golikov   Anatoliy  Vinogradov    
  • 2. Business  goals   •  “Real-­‐Gme”  data  availability   •  Near-­‐real-­‐Gme  update  of  fraud  models   •  Controlled  data  delay   •  BePer  hardware  scaling   •  Summarizing  data  as  close  to  source  as  possible   •  BePer  network  uGlizaGon  and  reliability        
  • 3. Data  flow   1.6B records/day Sustained 100KQPS Peak 200KQPS s og QL Score Join Filter Initial Events Aggregate Reporting Database DT Events Ev ida Reports nc e
  • 4. Data  flow  –  Hadoop  inefficiency  hypothesis   •  Large  batch  architecture  for  offline  processing   •  Hadoop’s  shuffle  phase  dumps  data  to  disks   •  Several  Gmes  in  some  cases!!!   •  AcGve  dataset  fits  into  cluster  memory   •  SessionizaGon  –  10’s  of  GB   •  AggregaGon  –  10’s  of  GB   •  RAM  is  1000’s  Gmes  faster  than  HDD        
  • 5. In-­‐stream  processing  –  Benefits     •  Immediately  available  results   •  Results    are  delivered  with  controlled  delay  (15  mins  –  1hour)   •  Time-­‐sensiGve  models  (e.g.  fraud)  are  updated  in  near  real-­‐Gme   •  Data  can  be  delivered  to  clients  immediately   •  Efficient  resource  uGlizaGon   •  BePer  scaling  coefficient   •  Smoother  workload  and  bandwidth  distribuGon   •  Less  resource  overprovisioning            
  • 6. Non-­‐FuncGonal  Requirements   •  Horizontal  scalability   •  Limit  on  data  loss  (less  than  0.1%)   •  Tolerance  to  single  node  failure   •  Ops  guys  will  sleep  bePer  at  night   •  It  happens!!!   •  Easy  recovery   •  Maintenance   •  No  data  loss  on  deployment     •  Monitoring  &  alerGng  
  • 7. Storm/Trident/Kaea/C*  –  Hybrid  soluGon   Ev en ts Events n Eve Storm QLogs ts Kafka Kafka Exporter Cassandra Reports Reporting DB
  • 8. Storm/Trident/Kaea/C*  –  Reliable  processing   •  Storm  &  Trident  transacGons   •  Data  are  processed  by  micro-­‐batches  (transacGons)   •  External  storage  used  to  keep  state  between  transacGons   •  AutomaGc  rollback  to  last  checkpoint   •  Kaea  –  distributed  queue  manager   •  Data  feed  replay  for  retry  or  recovery   •  Load  spikes  are  smoothed     •  Cross-­‐DC  replicaGon     •  Cassandra   •  Key-­‐value  store  for  de-­‐duplicaGon   •  Resilience  based  on  replicaGon  
  • 9. Our  Storm  distribuGon   •  Storm  High-­‐Availability   •  Share  Nimbus  state  through  distributed  cache   •  Metrics  streaming  to  Graphite   •  Bug  fixes   •  Packaging  into  RPM/DEB        
  • 10.
  • 11. Data  Sources   Frontend Server Server Tailer Agent Msg ... Msg Mark Log Files Check point •  Hard  latency  requirements   •  10ms  response   •  Read  logs  produced  by  front-­‐ end  servers   •  Periodic  checkpoints   •  Older  data  dropped    
  • 12. Message  JiPer     Time Server 1 Server 2 Server 3 Data feed •  Messages  are  arbitrarily  reordered   •  Use  jiPer  buffer  and  drop  outliers    
  • 13. Data  flow   1.6B records/day Sustained 100KQPS Peak 200KQPS s og QL Score Join Filter Initial Events Aggregate Reporting Database DT Events Ev ida Reports nc e
  • 14. Robot  filtering   •  Blacklist/whitelist  based   •  Simultaneous  mulG-­‐paPern  matching   with  Aho-­‐Corasick  algorithm   •  60x  improvement  over  frequency   opGmized  String.indexOf        
  • 15. Join  (SessionizaGon)  –  Algorithm   Init DT DT Timeout DT Impression 1 Join Window Init DT DT DT Unload Emit Impression 2 Emit DT DT DT Timeout Impression 3 Drop Transaction 1 Transaction 2 Time
  • 16. Recovery  strategy   Rollback Transaction N Processing/Failure point Data Feed Checkpoint N - 1 Checkpoint N •  Read  data  in  batches  (transacGon)   •  On  success  write  checkpoint   •  On  failure  return  to  previous  checkpoint   •  On  catastrophic  failure  rewind  data  feed  to  a  point   before  the  problem  started  
  • 17. Logical  Gme   12:04 12:01 12:00 12:05 12:00 11:59 11:58 12:00 11:59 12:00 •  Wall-­‐clock  does  not  work   Late Delivery Read point •  Load  spikes     •  Recovery  rewinds  data  feed  to   previous  Gme   •  Logical  clock   •  Maximum  Gmestamp  seen  by  Bolt   •  New  messages  with  smaller   Gmestamp  are  late   •  No  clock  synchronizaGon   •  All  bolts  are  in  “weak  synchrony”  
  • 18. Join  topology     Reads  data  from  Kaea   Parse  logs  and  gets  ASID   Bolt   Joins  events  with  DT  by  ASID   tx1:ASID1:Event Store  joins  for  fast  failover   Bolt   ASID1: Event,DT tx1:event logs batch Spout   Bolt   tx1:ASID1:DT ASID1 : Event, DTs ASID2 : Event, DTs ... tx1:event logs batch ASID2:Event Bolt   Bolt   tx1:ASID2:Event      
  • 19. Join  (SessionizaGon)  topology   •  Flow   •  Microbatches  of  iniGal  and  ping  events  are  read  from   Kaea   •  Map  ASID  -­‐>  TTL,  {IniGal,  ping_x}  is  stored  in  memory   •  STATE  is  mirrored  to  Cassandra!!!   •  Failure  recovery   •  Lost  state  is  recovered  from  Cassandra   •  FinalizaGon  –  on  each  transacGon:   •  Process  state  is  commiPed  and  messages  are  evicted  to   Kaea   •  Evicted  messages  are  deleted  from  Cassandra  
  • 20. Failure TX: 1 Retry TX: 2 TX: 2R APC-1-1-1 100 APC-1-1-2 57 37 . . . . . . . . . APC-2-1-4 14 214 Transaction 1 TX: 3 32 Transaction 2 46 37 28 . . . 214 . . . 82 Transaction 3
  • 21. AggregaGon   •  Flow   •  Group  events  by  reporGng  dimensions   •  Count  events  in  groups   •  Write  counters  to  C*:   • Row  key  =  PREFIX(GROUPID)   • Column  key  =  SUFFIX(GROUPID)  +  TXID   • Value  =  COUNT(*)   •  Failure  recovery   •  Overwrite  results  of  failed  batch  in  Cassandra   •  FinalizaGon   •  Read  data  from  Cassandra  by  parallel  extractor    
  • 22. Cassandra  Data  Export     •  Delayed  data  export  to  accommodate   ji;er   •  Parallel  read  from  several  points  on  the   ring   •  On  major  incident  recovery  –  re-­‐export   data   •  Dirty  hack  -­‐  aggregaEon  topology  should   stream  data  to  analyEcal  database   directly   Read point Re p ad oin t Cassandra Nodes R ea d po i nt
  • 23. APC  aggregaGon  topology  (snippet)   Stream  stream  =  topology.                                  newStream("input-­‐topic-­‐reading-­‐spout",                                                  new  OpaqueTridentKaOaSpout(config.groupingTridentKaOaConfig())).                                  shuffle().      each(fields("transformed-­‐event"),                                                  new  ExtractFields("niagara.storm.adpoc.group.ApcGroupingFieldsParser"),                                                  fields("grouping"))    stream.groupBy(fields("bucket",  "grouping")).name("aggregaEon").                                  persistentAggregate(  /*  cassandra  store  */),                                                  fields("transformed-­‐event"),                                                  new  FirewallRecordCount(false),    fields("value"))                          
  • 24. Storm   Storm   Storm   hi1.4xlarge hi1.4xlarge Amazon   S3   Reader   hi1.4xlarge hi1.4xlarge C* hi1.4xlarge hi1.4xlarge hi1.4xlarge Performance  environment   C* C*
  • 25. Performance  environment  -­‐  VMs   Instance  type   Spec   vCPU   cores   Memory   (GB)   Storage   (GB)   6  hi1.4xlarge   16   60   2  x  1,024   SSD   •  Network  10Gbit/sec   •  SSD   •  16000  IOPS   •  SequenGal  write  –  300MB/sec  x  2  
  • 26. AggregaGon:  Apc  +  S05  +  L04  
  • 27. AggregaGon  –  Write  performance     Report Storm Cassandra   Size, KNet, Net, MB/ Disk W, Name T, K/s CPU, % CPU, % rows MB/s s MB/s   apc 29 220 83 127   101.5/1.5 1,3   l04 249 200 80 125   125/2.5 5   s05 7400 120   85 80 4015/7   30 United 7678 110   87 70 5015/7   36   •  Network  saturaGon  observed   •  Throughput  10x  required  value    
  • 28. AggregaGon  –  Read  performance   Report S05 S05 APC L04 Size, Time, Write, KM/ Krows Threads sec sec 7400 10 95   60   7400 3 150   80   29 3 15 120 249 3 25 120 •  Read  of  1  hour  worth  of  data  takes  2.5  minutes   •  Moderate  performance  degradaGon  observed  
  • 29. AggregaGon  –  Long  transacGons   •  Pros   •  BePer  aggregaGon  raGo   •  Faster  report  export   •  Cons   •  Batch  type  of  workload  (spiky)   •  Longer  recovery   •  Test  results   •  Performance  up  to  200  Kmsg/sec   •  Cassandra  unstable  due  to  node  failures  (GC)  
  • 30. Join  Topology  –  Output  Stream   Krecord/sec 30 28 25 20 20 15 12.5 13 C: ON, U: OFF, S:ON C: ON, U: ON, S:ON KMsg/psec 10 5 0 C: OFF, U: ON, S:ON C: OFF, U: ON, S:OFF
  • 31. Join  Topology  –  In-­‐memory  state  size   •  Stream  frequency  9.7  Kmsg/lsec   •  Join  window  –  15  minutes   State, State size Unload K/task Mb/task Total heap, GB Total state, GB OFF 500 500 60 24 ON 230 240 36 11.5
  • 33. Scalability  –  Test  results   Output Stream(vCPU), Kmsg/sec 50 45 43 40 35 30 27 25 C:OFF C:ON 20 16 15 10 9 6 5 0 24 48 72
  • 34. Join  topology  -­‐  Conclusion   •  Linear  scaling  limited  by  CPU   •  72GB  RAM  for  10Kmsg/sec  30  min  window     •  6  hi1.4xlarge  instances  can  process     •  16  Kmsg/sec  (lower  bound)   •  28  Kmsg/sec  (upper  bound)   •  OpGmizaGon   •  Stateless  join  topology  
  • 35. Topology  performance  consideraGons   ●  Avoid  text  and  regular  expressions.  POJO  are  friends   ●  Network  bandwidth  is  important  (saturated  at  1Gb/s)   ●  Parsing  is  heavy  (deserializaGon  included)     ●  ●  ●  Track  cases  when  tuple  crosses  worker  boundary     Keep  spouts  separate  from  parsers   Lazy  parsing  (extract  fields  when  you  really  have  to)   ●  ●    Less  garbage  with  shorter  lifeGme     Easier  profiling  
  • 36. Cassandra  performance ●      Garbage  Collector  is  main  performance  limit   ○  Cassandra  uses  Concurrent-­‐Mark-­‐Sweep  (CMS)  by  default   ○  If  CMS  cannot  free  enough  space,  JVM  will  fall  back  to  single-­‐ threaded  serial  collector  (‘stop  the  world’)   ○  With  12  GB  heap  size  serial  collector  might  take  up  to  60  seconds  
  • 37. Cassandra:  conclusions ●  Garbage  collecGon   ○  ○  ●      Spiky  (batch  type)  workload  is  bad  for  C*   The  smaller  the  cluster,  the  less  heavy-­‐write  column   families  you  can  have   Wide  tables  are  bePer  than  narrow  tables     ○  1  row  with  10  columns  is  bePer  than  10  rows  with  1   column  in  terms  of  throughput    
  • 38. QuesGons?   Alexey Kharlamov @aih1013 aharlamov@gmail.com Kiril Tsemekhman kiril@integralads.com Rahul Ratnakar rahul@integralads.com