Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013

IN-‐STREAM
PROCESSING
WITH
KAFKA,
STORM,

CASSANDRA

Integral
Ad
Science

Niagara
Team

Kiril
Tsemekhman

Alexey
Kharlamov

Rahul
Ratnakar

Evelina
Stepanova

Rafael
Bagmanov

KonstanGn
Golikov

Anatoliy
Vinogradov

Business
goals

•  “Real-‐Gme”
data
availability

•  Near-‐real-‐Gme
update
of
fraud
models

•  Controlled
data
delay

•  BePer
hardware
scaling

•  Summarizing
data
as
close
to
source
as
possible

•  BePer
network
uGlizaGon
and
reliability

Data
ﬂow

1.6B records/day
Sustained 100KQPS
Peak
200KQPS

s
og
QL

Score

Join

Filter

Initial Events
Aggregate

Reporting
Database

DT Events
Ev

ida

Reports

nc

e

Data
flow
–
Hadoop
inefficiency
hypothesis

•  Large
batch
architecture
for
offline
processing

•  Hadoop’s
shuffle
phase
dumps
data
to
disks

•  Several
Gmes
in
some
cases!!!

•  AcGve
dataset
fits
into
cluster
memory

•  SessionizaGon
–
10’s
of
GB

•  AggregaGon
–
10’s
of
GB

•  RAM
is
1000’s
Gmes
faster
than
HDD

In-‐stream
processing
–
Benefits

•  Immediately
available
results

•  Results

are
delivered
with
controlled
delay
(15
mins
–
1hour)

•  Time-‐sensiGve
models
(e.g.
fraud)
are
updated
in
near
real-‐Gme

•  Data
can
be
delivered
to
clients
immediately

•  Efficient
resource
uGlizaGon

•  BePer
scaling
coefficient

•  Smoother
workload
and
bandwidth
distribuGon

•  Less
resource
overprovisioning

Non-‐FuncGonal
Requirements

•  Horizontal
scalability

•  Limit
on
data
loss
(less
than
0.1%)

•  Tolerance
to
single
node
failure

•  Ops
guys
will
sleep
bePer
at
night

•  It
happens!!!

•  Easy
recovery

•  Maintenance

•  No
data
loss
on
deployment

•  Monitoring
&
alerGng

Storm/Trident/Kaea/C*
–
Hybrid
soluGon

Ev

en

ts

Events

n
Eve

Storm

QLogs

ts

Kafka

Kafka

Exporter

Cassandra

Reports

Reporting DB

Storm/Trident/Kaea/C*
–
Reliable
processing

•  Storm
&
Trident
transacGons

•  Data
are
processed
by
micro-‐batches
(transacGons)

•  External
storage
used
to
keep
state
between
transacGons

•  AutomaGc
rollback
to
last
checkpoint

•  Kaea
–
distributed
queue
manager

•  Data
feed
replay
for
retry
or
recovery

•  Load
spikes
are
smoothed

•  Cross-‐DC
replicaGon

•  Cassandra

•  Key-‐value
store
for
de-‐duplicaGon

•  Resilience
based
on
replicaGon

Our
Storm
distribuGon

•  Storm
High-‐Availability

•  Share
Nimbus
state
through
distributed
cache

•  Metrics
streaming
to
Graphite

•  Bug
ﬁxes

•  Packaging
into
RPM/DEB

Data
Sources

Frontend Server
Server

Tailer
Agent

Msg
...
Msg
Mark

Log
Files

Check
point

•  Hard
latency
requirements

•  10ms
response

•  Read
logs
produced
by
front-‐
end
servers

•  Periodic
checkpoints

•  Older
data
dropped

Message
JiPer

Time

Server 1

Server 2

Server 3

Data feed

•  Messages
are
arbitrarily
reordered

•  Use
jiPer
buﬀer
and
drop
outliers

Robot
ﬁltering

•  Blacklist/whitelist
based

•  Simultaneous
mulG-‐paPern
matching

with
Aho-‐Corasick
algorithm

•  60x
improvement
over
frequency

opGmized
String.indexOf

Join
(SessionizaGon)
–
Algorithm

Init

DT

DT

Timeout

DT

Impression 1
Join Window
Init

DT

DT

DT

Unload
Emit

Impression 2

Emit
DT

DT

DT

Timeout

Impression 3

Drop

Transaction 1

Transaction 2

Time

Recovery
strategy

Rollback
Transaction N

Processing/Failure
point

Data Feed
Checkpoint N - 1

Checkpoint N

•  Read
data
in
batches
(transacGon)

•  On
success
write
checkpoint

•  On
failure
return
to
previous
checkpoint

•  On
catastrophic
failure
rewind
data
feed
to
a
point

before
the
problem
started

Logical
Gme

12:04
12:01
12:00
12:05
12:00
11:59
11:58
12:00
11:59
12:00

•  Wall-‐clock
does
not
work

Late
Delivery

Read point

•  Load
spikes

•  Recovery
rewinds
data
feed
to

previous
Gme

•  Logical
clock

•  Maximum
Gmestamp
seen
by
Bolt

•  New
messages
with
smaller

Gmestamp
are
late

•  No
clock
synchronizaGon

•  All
bolts
are
in
“weak
synchrony”

Join
topology

Reads
data
from
Kaea

Parse
logs
and
gets
ASID

Bolt

Joins
events
with
DT
by
ASID

tx1:ASID1:Event

Store
joins
for
fast
failover

Bolt

ASID1: Event,DT

tx1:event logs
batch

Spout

Bolt

tx1:ASID1:DT
ASID1 : Event, DTs
ASID2 : Event, DTs
...

tx1:event logs
batch

ASID2:Event

Bolt

Bolt

tx1:ASID2:Event

Join
(SessionizaGon)
topology

•  Flow

•  Microbatches
of
iniGal
and
ping
events
are
read
from

Kaea

•  Map
ASID
-‐>
TTL,
{IniGal,
ping_x}
is
stored
in
memory

•  STATE
is
mirrored
to
Cassandra!!!

•  Failure
recovery

•  Lost
state
is
recovered
from
Cassandra

•  FinalizaGon
–
on
each
transacGon:

•  Process
state
is
commiPed
and
messages
are
evicted
to

Kaea

•  Evicted
messages
are
deleted
from
Cassandra

Failure
TX: 1

Retry

TX: 2

TX: 2R

APC-1-1-1

100

APC-1-1-2

57

37

. . .

. . .

. . .

APC-2-1-4

14

214

Transaction 1

TX: 3

32

Transaction 2

46

37

28

. . .
214

. . .
82

Transaction 3

AggregaGon

•  Flow

•  Group
events
by
reporGng
dimensions

•  Count
events
in
groups

•  Write
counters
to
C*:

• Row
key
=
PREFIX(GROUPID)

• Column
key
=
SUFFIX(GROUPID)
+
TXID

• Value
=
COUNT(*)

•  Failure
recovery

•  Overwrite
results
of
failed
batch
in
Cassandra

•  FinalizaGon

•  Read
data
from
Cassandra
by
parallel
extractor

Cassandra
Data
Export

•  Delayed
data
export
to
accommodate

ji;er

•  Parallel
read
from
several
points
on
the

ring

•  On
major
incident
recovery
–
re-‐export

data

•  Dirty
hack
-‐
aggregaEon
topology
should

stream
data
to
analyEcal
database

directly

Read point

Re

p
ad

oin

t

Cassandra Nodes

R
ea
d
po
i
nt

APC
aggregaGon
topology
(snippet)

Stream
stream
=
topology.

newStream("input-‐topic-‐reading-‐spout",

new
OpaqueTridentKaOaSpout(config.groupingTridentKaOaConfig())).

shuffle().

each(fields("transformed-‐event"),

new
ExtractFields("niagara.storm.adpoc.group.ApcGroupingFieldsParser"),

fields("grouping"))

stream.groupBy(fields("bucket",
"grouping")).name("aggregaEon").

persistentAggregate(
/*
cassandra
store
*/),

fields("transformed-‐event"),

new
FirewallRecordCount(false),

fields("value"))

Storm

Storm

Storm

hi1.4xlarge

hi1.4xlarge

Amazon

S3

Reader

hi1.4xlarge

hi1.4xlarge

C*

hi1.4xlarge

hi1.4xlarge

hi1.4xlarge

Performance
environment

C*

C*

Performance
environment
-‐
VMs

Instance
type

Spec

vCPU

cores

Memory

(GB)

Storage

(GB)

6
hi1.4xlarge

16

60

2
x
1,024

SSD

•  Network
10Gbit/sec

•  SSD

•  16000
IOPS

•  SequenGal
write
–
300MB/sec
x
2

AggregaGon:
Apc
+
S05
+
L04

AggregaGon
–
Write
performance

Report
Storm
Cassandra

Size, KNet,
Net, MB/ Disk W,
Name
T, K/s CPU, %
CPU, %
rows
MB/s
s
MB/s

apc
29
220
83
127

101.5/1.5
1,3

l04
249
200
80
125

125/2.5
5

s05
7400
120

85
80
4015/7

30
United
7678
110

87
70
5015/7

36

•  Network
saturaGon
observed

•  Throughput
10x
required
value

AggregaGon
–
Read
performance

Report
S05
S05
APC
L04

Size,
Time, Write, KM/
Krows Threads sec sec
7400
10
95

60

7400
3 150

80

29
3
15
120
249
3
25
120

•  Read
of
1
hour
worth
of
data
takes
2.5
minutes

•  Moderate
performance
degradaGon
observed

AggregaGon
–
Long
transacGons

•  Pros

•  BePer
aggregaGon
raGo

•  Faster
report
export

•  Cons

•  Batch
type
of
workload
(spiky)

•  Longer
recovery

•  Test
results

•  Performance
up
to
200
Kmsg/sec

•  Cassandra
unstable
due
to
node
failures
(GC)

Join
Topology
–
Output
Stream

Krecord/sec
30

28

25

20
20

15

12.5

13

C: ON, U: OFF,
S:ON

C: ON, U: ON,
S:ON

KMsg/psec

10

5

0

C: OFF, U: ON,
S:ON

C: OFF, U: ON,
S:OFF

Join
Topology
–
In-‐memory
state
size

•  Stream
frequency
9.7
Kmsg/lsec

•  Join
window
–
15
minutes

State, State size
Unload K/task Mb/task

Total heap,
GB
Total state, GB

OFF

500

500

60

24

ON

230

240

36

11.5

m2.4xlarge

hi1.4xlarge

Storm

hi1.4xlarge

m2.4xlarge

hi1.4xlarge

hi1.4xlarge

m2.4xlarge

hi1.4xlarge

C*
hi1.4xlarge

Scalability
–
Test
environment

Storm

Storm

C*
C*

Scalability
–
Test
results

Output Stream(vCPU), Kmsg/sec
50
45

43

40
35
30
27

25

C:OFF
C:ON

20
16

15
10

9
6

5
0
24

48

72

Join
topology
-‐
Conclusion

•  Linear
scaling
limited
by
CPU

•  72GB
RAM
for
10Kmsg/sec
30
min
window

•  6
hi1.4xlarge
instances
can
process

•  16
Kmsg/sec
(lower
bound)

•  28
Kmsg/sec
(upper
bound)

•  OpGmizaGon

•  Stateless
join
topology

Topology
performance
consideraGons

● 

Avoid
text
and
regular
expressions.
POJO
are
friends

● 

Network
bandwidth
is
important
(saturated
at
1Gb/s)

● 

Parsing
is
heavy
(deserializaGon
included)

● 
● 

● 

Track
cases
when
tuple
crosses
worker
boundary

Keep
spouts
separate
from
parsers

Lazy
parsing
(extract
ﬁelds
when
you
really
have
to)

● 
● 

Less
garbage
with
shorter
lifeGme

Easier
proﬁling

Cassandra
performance
● 

Garbage
Collector
is
main
performance
limit

○ 

Cassandra
uses
Concurrent-‐Mark-‐Sweep
(CMS)
by
default

○ 

If
CMS
cannot
free
enough
space,
JVM
will
fall
back
to
single-‐
threaded
serial
collector
(‘stop
the
world’)

○ 

With
12
GB
heap
size
serial
collector
might
take
up
to
60
seconds

Cassandra:
conclusions
● 

Garbage
collecGon

○ 
○ 

● 

Spiky
(batch
type)
workload
is
bad
for
C*

The
smaller
the
cluster,
the
less
heavy-‐write
column

families
you
can
have

Wide
tables
are
bePer
than
narrow
tables

○ 

1
row
with
10
columns
is
bePer
than
10
rows
with
1

column
in
terms
of
throughput

QuesGons?

Alexey Kharlamov
@aih1013
aharlamov@gmail.com
Kiril Tsemekhman
kiril@integralads.com

Rahul Ratnakar
rahul@integralads.com

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013

Similaire à Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013 (20)

Dernier

Dernier (20)

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013