At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
Gen AI in Business - Global Trends Report 2024.pdf
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC Storm User Group Meetup - 21st Nov 2013
1. IN-‐STREAM
PROCESSING
WITH
KAFKA,
STORM,
CASSANDRA
Integral
Ad
Science
Niagara
Team
Kiril
Tsemekhman
Alexey
Kharlamov
Rahul
Ratnakar
Evelina
Stepanova
Rafael
Bagmanov
KonstanGn
Golikov
Anatoliy
Vinogradov
2. Business
goals
• “Real-‐Gme”
data
availability
• Near-‐real-‐Gme
update
of
fraud
models
• Controlled
data
delay
• BePer
hardware
scaling
• Summarizing
data
as
close
to
source
as
possible
• BePer
network
uGlizaGon
and
reliability
3. Data
flow
1.6B records/day
Sustained 100KQPS
Peak
200KQPS
s
og
QL
Score
Join
Filter
Initial Events
Aggregate
Reporting
Database
DT Events
Ev
ida
Reports
nc
e
4. Data
flow
–
Hadoop
inefficiency
hypothesis
• Large
batch
architecture
for
offline
processing
• Hadoop’s
shuffle
phase
dumps
data
to
disks
• Several
Gmes
in
some
cases!!!
• AcGve
dataset
fits
into
cluster
memory
• SessionizaGon
–
10’s
of
GB
• AggregaGon
–
10’s
of
GB
• RAM
is
1000’s
Gmes
faster
than
HDD
5. In-‐stream
processing
–
Benefits
• Immediately
available
results
• Results
are
delivered
with
controlled
delay
(15
mins
–
1hour)
• Time-‐sensiGve
models
(e.g.
fraud)
are
updated
in
near
real-‐Gme
• Data
can
be
delivered
to
clients
immediately
• Efficient
resource
uGlizaGon
• BePer
scaling
coefficient
• Smoother
workload
and
bandwidth
distribuGon
• Less
resource
overprovisioning
6. Non-‐FuncGonal
Requirements
• Horizontal
scalability
• Limit
on
data
loss
(less
than
0.1%)
• Tolerance
to
single
node
failure
• Ops
guys
will
sleep
bePer
at
night
• It
happens!!!
• Easy
recovery
• Maintenance
• No
data
loss
on
deployment
• Monitoring
&
alerGng
7. Storm/Trident/Kaea/C*
–
Hybrid
soluGon
Ev
en
ts
Events
n
Eve
Storm
QLogs
ts
Kafka
Kafka
Exporter
Cassandra
Reports
Reporting DB
8. Storm/Trident/Kaea/C*
–
Reliable
processing
• Storm
&
Trident
transacGons
• Data
are
processed
by
micro-‐batches
(transacGons)
• External
storage
used
to
keep
state
between
transacGons
• AutomaGc
rollback
to
last
checkpoint
• Kaea
–
distributed
queue
manager
• Data
feed
replay
for
retry
or
recovery
• Load
spikes
are
smoothed
• Cross-‐DC
replicaGon
• Cassandra
• Key-‐value
store
for
de-‐duplicaGon
• Resilience
based
on
replicaGon
9. Our
Storm
distribuGon
• Storm
High-‐Availability
• Share
Nimbus
state
through
distributed
cache
• Metrics
streaming
to
Graphite
• Bug
fixes
• Packaging
into
RPM/DEB
10.
11. Data
Sources
Frontend Server
Server
Tailer
Agent
Msg
...
Msg
Mark
Log
Files
Check
point
• Hard
latency
requirements
• 10ms
response
• Read
logs
produced
by
front-‐
end
servers
• Periodic
checkpoints
• Older
data
dropped
12. Message
JiPer
Time
Server 1
Server 2
Server 3
Data feed
• Messages
are
arbitrarily
reordered
• Use
jiPer
buffer
and
drop
outliers
13. Data
flow
1.6B records/day
Sustained 100KQPS
Peak
200KQPS
s
og
QL
Score
Join
Filter
Initial Events
Aggregate
Reporting
Database
DT Events
Ev
ida
Reports
nc
e
14. Robot
filtering
• Blacklist/whitelist
based
• Simultaneous
mulG-‐paPern
matching
with
Aho-‐Corasick
algorithm
• 60x
improvement
over
frequency
opGmized
String.indexOf
16. Recovery
strategy
Rollback
Transaction N
Processing/Failure
point
Data Feed
Checkpoint N - 1
Checkpoint N
• Read
data
in
batches
(transacGon)
• On
success
write
checkpoint
• On
failure
return
to
previous
checkpoint
• On
catastrophic
failure
rewind
data
feed
to
a
point
before
the
problem
started
17. Logical
Gme
12:04
12:01
12:00
12:05
12:00
11:59
11:58
12:00
11:59
12:00
• Wall-‐clock
does
not
work
Late
Delivery
Read point
• Load
spikes
• Recovery
rewinds
data
feed
to
previous
Gme
• Logical
clock
• Maximum
Gmestamp
seen
by
Bolt
• New
messages
with
smaller
Gmestamp
are
late
• No
clock
synchronizaGon
• All
bolts
are
in
“weak
synchrony”
18. Join
topology
Reads
data
from
Kaea
Parse
logs
and
gets
ASID
Bolt
Joins
events
with
DT
by
ASID
tx1:ASID1:Event
Store
joins
for
fast
failover
Bolt
ASID1: Event,DT
tx1:event logs
batch
Spout
Bolt
tx1:ASID1:DT
ASID1 : Event, DTs
ASID2 : Event, DTs
...
tx1:event logs
batch
ASID2:Event
Bolt
Bolt
tx1:ASID2:Event
19. Join
(SessionizaGon)
topology
• Flow
• Microbatches
of
iniGal
and
ping
events
are
read
from
Kaea
• Map
ASID
-‐>
TTL,
{IniGal,
ping_x}
is
stored
in
memory
• STATE
is
mirrored
to
Cassandra!!!
• Failure
recovery
• Lost
state
is
recovered
from
Cassandra
• FinalizaGon
–
on
each
transacGon:
• Process
state
is
commiPed
and
messages
are
evicted
to
Kaea
• Evicted
messages
are
deleted
from
Cassandra
21. AggregaGon
• Flow
• Group
events
by
reporGng
dimensions
• Count
events
in
groups
• Write
counters
to
C*:
• Row
key
=
PREFIX(GROUPID)
• Column
key
=
SUFFIX(GROUPID)
+
TXID
• Value
=
COUNT(*)
• Failure
recovery
• Overwrite
results
of
failed
batch
in
Cassandra
• FinalizaGon
• Read
data
from
Cassandra
by
parallel
extractor
22. Cassandra
Data
Export
• Delayed
data
export
to
accommodate
ji;er
• Parallel
read
from
several
points
on
the
ring
• On
major
incident
recovery
–
re-‐export
data
• Dirty
hack
-‐
aggregaEon
topology
should
stream
data
to
analyEcal
database
directly
Read point
Re
p
ad
oin
t
Cassandra Nodes
R
ea
d
po
i
nt
23. APC
aggregaGon
topology
(snippet)
Stream
stream
=
topology.
newStream("input-‐topic-‐reading-‐spout",
new
OpaqueTridentKaOaSpout(config.groupingTridentKaOaConfig())).
shuffle().
each(fields("transformed-‐event"),
new
ExtractFields("niagara.storm.adpoc.group.ApcGroupingFieldsParser"),
fields("grouping"))
stream.groupBy(fields("bucket",
"grouping")).name("aggregaEon").
persistentAggregate(
/*
cassandra
store
*/),
fields("transformed-‐event"),
new
FirewallRecordCount(false),
fields("value"))
34. Join
topology
-‐
Conclusion
• Linear
scaling
limited
by
CPU
• 72GB
RAM
for
10Kmsg/sec
30
min
window
• 6
hi1.4xlarge
instances
can
process
• 16
Kmsg/sec
(lower
bound)
• 28
Kmsg/sec
(upper
bound)
• OpGmizaGon
• Stateless
join
topology
35. Topology
performance
consideraGons
●
Avoid
text
and
regular
expressions.
POJO
are
friends
●
Network
bandwidth
is
important
(saturated
at
1Gb/s)
●
Parsing
is
heavy
(deserializaGon
included)
●
●
●
Track
cases
when
tuple
crosses
worker
boundary
Keep
spouts
separate
from
parsers
Lazy
parsing
(extract
fields
when
you
really
have
to)
●
●
Less
garbage
with
shorter
lifeGme
Easier
profiling
36. Cassandra
performance
●
Garbage
Collector
is
main
performance
limit
○
Cassandra
uses
Concurrent-‐Mark-‐Sweep
(CMS)
by
default
○
If
CMS
cannot
free
enough
space,
JVM
will
fall
back
to
single-‐
threaded
serial
collector
(‘stop
the
world’)
○
With
12
GB
heap
size
serial
collector
might
take
up
to
60
seconds
37. Cassandra:
conclusions
●
Garbage
collecGon
○
○
●
Spiky
(batch
type)
workload
is
bad
for
C*
The
smaller
the
cluster,
the
less
heavy-‐write
column
families
you
can
have
Wide
tables
are
bePer
than
narrow
tables
○
1
row
with
10
columns
is
bePer
than
10
rows
with
1
column
in
terms
of
throughput