SlideShare une entreprise Scribd logo
1  sur  227
1
DESIGNING MODERN STREAMING
DATA APPLICATIONS
ARUN KEJARIWAL, KARTHIK RAMASAMY
@arun_kejariwal @karthikz
2
ABOUT US
3
4
Connected World
5
Ubiquity of Real-Time Data Streams
6
Data-Driven Decision Making
UNLOCKING INSIGHTS
7
GUIDE DECISION MAKING
DETERMINING TOP TRAFFIC INTENSIVE IP ADDRESSES
IDENTIFYING ARBITRAGE OPPORTUNITIES IN FINANCIAL
TRAINING
DETERMINING SESSIONS WHOSE DURATION >2X THAN
NORMAL
FORECASTING DETERIORATION IN HEALTH METRICS
# UNIQUE USERS/DAY OF A MOBILE APP
TARGETED ADVERTISING
ON-THE-FLY INSIGHTS
8
“PERISHABLE”
RECENCY IS CRITICAL
STREAMING/FAST DATA PROCESSING
9
✦ Events are analyzed and processed as they arrive
✦ Decisions are timely, contextual and based on fresh data
✦ Decision latency is eliminated
✦ Data in motion
Ingest/
Buffer
Analyze Act
MICROSERVICES
MODEL INFERENCEWORKFLOWS ANALYTICS
MONITORING
STREAMING APPLICATION PATTERNS
STREAM PROCESSING PATTERN
11
ComputeMessaging
Storage
Data	Inges6on Data	Processing
Results	StorageData	Storage
Data	
Serving
LANDSCAPE
PLATFORMS
APPLICATION DOMAINS
PRACTICAL TAKEAWAYS
ANALYTICS
BUSINESS IMPACT
OUTLINE
12
FOCUS AREAS
13
Landscape of Platforms for Streaming Data
PLATFORM
14
COMPONENTS
PROCESSING
VELOCITY
STORAGE
UNBOUNDED
VARIETY
MESSAGING
VOLUME
15
APACHE KAFKA RABBITMQ
AKKA METAQ
DATABUS ACTIVEMQ
APACHE FLUME KESTREL
SCRIBE APACHE PULSAR
Messaging
NEXT GENERATION MESSAGING
16
DURABILITYCONSISTENCY
EASE OF OPERATIONS ISOLATION
RESILIENCY
SCALABILITY
KEY PROPERTIES
b
b
b
b
b
b
INFINITE RETENTIONb
APACHE PULSAR
17
Flexible Messaging + Streaming System
backed by a durable log storage
APACHE PULSAR
18
Bookie Bookie Bookie
Broker Broker Broker
Producer Consumer
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
APACHE PULSAR - BROKER
19
✦ Broker is the only point of interaction for clients (producers and consumers)
✦ Brokers acquire ownership of group of topics and “serve” them
✦ Broker has no durable state
✦ Provides service discovery mechanism for client to connect to right broker
APACHE PULSAR - BROKER
20
APACHE PULSAR - CONSISTENCY
21
Bookie
Bookie
BookieBrokerProducer
APACHE PULSAR - DURABILITY
22
Bookie
Bookie
BookieBrokerProducer
Journal
Journal
Journal
fsync
fsync
fsync
APACHE PULSAR - ISOLATION
23
APACHE PULSAR - SEGMENT STORAGE
24
234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
APACHE PULSAR - RESILIENCY
25
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
APACHE PULSAR - SEAMLESS CLUSTER EXPANSION
26
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment Y
Segment Z
Segment X
APACHE PULSAR - TIERED STORAGE
27
Low Cost Storage
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1Segment 4 Segment 4
PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE?
28
PARTITIONS VS. SEGMENTS - WHY SHOULD YOU CARE?
29
✦ In Kafka, partitions are assigned to brokers “permanently”
✦ A single partition is stored entirely in a single node
✦ Retention is limited by a single node storage capacity
✦ Failure recovery and capacity expansion require expensive “rebalancing”
✦ Rebalancing has a big impact over the system, affecting regular traffic
UNIFIED MESSAGING MODEL - STREAMING
30
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
A
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
X
Exclusive
UNIFIED MESSAGING MODEL - STREAMING
31
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
B
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
Failover
In case of failure in
consumer 1
UNIFIED MESSAGING MODEL - QUEUING
32
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 2
Consumer 3
Subscription
C
M4
M3
M2
M1
M0
Shared
Traffic is equally distributed
across consumers
Consumer 1
M4M3
M2M1M0
DISASTER RECOVERY
33
Topic	(T1) Topic	(T1)
Topic	(T1)
Subscrip6on	
(S1)
Subscrip6on	
(S1)
Producer		
(P1)
Consumer		
(C1)
Producer		
(P3)
Producer		
(P2)
Consumer		
(C2)
Data	Center	A Data	Center	B
Data	Center	C
Integrated in the
broker message flow
Simple configuration
to add/remove regions
Asynchronous (default)
and synchronous
replication
MULTITENANCY
34
Apache Pulsar Cluster
Product
Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
MarketingCampaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer Authentication
10 TB
7 TB
5 TB
3 TB
2 TB
8 TB
4 TB
3 TB
✦ Authentication
✦ Authorization
✦ Software isolation
๏ Storage quotas, flow control, back pressure, rate limiting
✦ Hardware isolation
๏ Constrain some tenants on a subset of brokers/bookies
PULSAR CLIENTS
35
Apache Pulsar Cluster
Java
Python
Go
C++ C
PULSAR PRODUCER
36
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
PULSAR CONSUMER
37
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}
SCHEMA REGISTRY
38
✦ Provides type safety to applications built on top of Pulsar
✦ Two approaches
๏ Client side - type safety enforcement up to the application
๏ Server side - system enforces type safety and ensures that producers and consumers remain synced
✦ Schema registry enables clients to upload data schemas on a topic basis.
✦ Schemas dictate which data types are recognized as valid for that topic
PULSAR SCHEMAS - HOW DO THEY WORK?
39
✦ Enforced at the topic level
✦ Pulsar schemas consists of
๏ Name - Name refers to the topic to which the schema is applied
๏ Payload - Binary representation of the schema
๏ Schema type - JSON, Protobuf and Avro
๏ User defined properties - Map of strings to strings (application specific - e.g git hash of the schema)
SCHEMA VERSIONING
40
PulsarClient client = PulsarClient.builder()
.serviceUrl(“http://broker.usw.example.com:6650")
.build()
Producer<SensorReading> producer = client.newProducer(JSONSchema.of(SensorReading.class))
.topic(“sensor-data”)
.sendTimeout(3, TimeUnit.SECONDS)
.create()
Scenario What happens
No schema exists for the topic Producer is created using the given schema
Schema already exists; producer
connects using the same schema
that’s already stored
Schema is transmitted to the broker, determines that it is already stored
Schema already exists; producer
connects using a new schema that is
compatible
Schema is transmitted, compatibility determined and stored as new schema
HOW TO PROCESS DATA MODELED AS STREAMS
41
✦ Consume data as it is produced (pub/sub)
✦ Light weight compute - transform and react to data as it arrives
✦ Heavy weight compute - continuous data processing
✦ Interactive query of stored streams
LIGHT WEIGHT COMPUTE
42
f(x)
Incoming	Messages Output	Messages
ABSTRACT VIEW OF COMPUTE REPRESENTATION
TRADITIONAL COMPUTE REPRESENTATION
43
DAG
%
%
%
%
%
Source 1
Source 2
Action
Action
Action
Sink 1
Sink 2
REALIZING COMPUTATION - EXPLICIT CODE
44
public static class SplitSentence extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
String sentence = tuple.getStringByField("sentence");
String words[] = sentence.split(" ");
for (String w : words) {
basicOutputCollector.emit(new Values(w));
}
}
}
STITCHED BY PROGRAMMERS
REALIZING COMPUTATION - FUNCTIONAL
45
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
TRADITIONAL REAL TIME - SEPARATE SYSTEMS
46
Messaging Compute
TRADITIONAL REAL TIME SYSTEMS
47
DEVELOPER EXPERIENCE
✦ Powerful API but complicated
๏ Does everyone really need to learn functional programming?
✦ Configurable and scalable but management overhead
✦ Edge systems have resource and management constraints
TRADITIONAL REAL TIME SYSTEMS
48
OPERATIONAL EXPERIENCE
✦ Multiple systems to operate
๏ IoT deployments routinely have thousands of edge systems
✦ Semantic differences
๏ Mismatch and duplication between systems
๏ Creates developer and operator friction
LESSONS LEARNED - USE CASES
49
✦ Data transformations
✦ Data classification
✦ Data enrichment
✦ Data routing
✦ Data extraction and loading
✦ Real time aggregation
✦ Microservices
Significant set of processing tasks are exceedingly simple
EMERGENCE OF CLOUD - SERVERLESS
50
✦ Simple function API
✦ Functions are submitted to the system
✦ Runs per events
✦ Composition APIs to do complex things
✦ Wildly popular
SERVERLESS VS. STREAMING
51
✦ Both are event driven architectures
✦ Both can be used for analytics and data serving
✦ Both have composition APIs
๏ Configuration based for serverless
๏ DSL based for streaming
✦ Serverless typically does not guarantee ordering
✦ Serverless is pay per action
STREAM NATIVE COMPUTE USING FUNCTIONS
52
✦ Simplest possible API -function or a procedure
✦ Support for multi language
✦ Use of native API for each language
✦ Scale developers
✦ Use of message bus native concepts - input and output as topics
✦ Flexible runtime - simple standalone applications vs managed system applications
APPLYING INSIGHT GAINED FROM SERVERLESS
PULSAR FUNCTIONS
53
SDK LESS API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
PULSAR FUNCTIONS
54
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
PULSAR FUNCTIONS
55
✦ Simplest possible API -function or a procedure
✦ Support for multi language
✦ Use of native API for each language
✦ Scale developers
✦ Use of message bus native concepts - input and output as topics
✦ Flexible runtime - simple standalone applications vs managed system applications
PULSAR FUNCTIONS
56
✦ Function executed for every message of input topic
✦ Support for multiple topics as inputs
✦ Function output goes into output topic - can be void topic as well
✦ SerDe takes care of serialization/deserialization of messages
๏ Custom SerDe can be provided by the users
๏ Integration with schema registry
PROCESSING GUARANTEES
57
✦ ATMOST_ONCE
๏ Message acked to Pulsar as soon as we receive it
✦ ATLEAST_ONCE
๏ Message acked to Pulsar after the function completes
๏ Default behavior - don’t want people to loose data
✦ EFFECTIVELY_ONCE
๏ Uses Pulsar’s inbuilt effectively once semantics
✦ Controlled at runtime by user
DEPLOYING FUNCTIONS - BROKER
58
Broker 1
Worker
Function
wordcount-1
Function
transform-2
Broker 1
Worker
Function
transform-1
Function
dataroute-1
Broker 1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
DEPLOYING FUNCTIONS - WORKER NODES
59
Worker
Function
wordcount-1
Function
transform-2
Worker
Function
transform-1
Function
dataroute-1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
Broker 1 Broker 2 Broker 3
Node 4 Node 5 Node 6
DEPLOYING FUNCTIONS - KUBERNETES
60
Function
wordcount-1
Function
transform-1
Function
transform-3
Pod 1 Pod 2 Pod 3
Broker 1 Broker 2 Broker 3
Pod 7 Pod 8 Pod 9
Function
dataroute-1
Function
wordcount-2
Function
transform-2
Pod 4 Pod 5 Pod 6
BUILT-IN STATE MANAGEMENT IN FUNCTIONS
61
✦ Functions can store state in inbuilt storage
๏ Framework provides a simple library to store and
retrieve state
✦ Support server side operations like counters
✦ Simplified application development
๏ No need to standup an extra system
DISTRIBUTED STATE IN FUNCTIONS
62
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
PULSAR - DATA IN AND OUT
63
✦ Users can write custom code using Pulsar producer and consumer API
✦ Challenges
๏ Where should the application to publish data or consume data from Pulsar?
๏ How should the application to publish data or consume data from Pulsar?
✦ Current systems have no organized and fault tolerant way to run applications that ingress and
egress data from and to external systems
PULSAR IO TO THE RESCUE
64
Apache Pulsar ClusterSource Sink
INTERACTIVE QUERYING OF STREAMS - PULSAR SQL
65
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment
Reader
Segment
Reader
Segment
Reader
Segment
Reader
Coordinator
PULSAR IO - EXECUTION
66
Broker 1
Worker
Sink
Cassandra-1
Source
Kinesis-2
Broker 2
Worker
Source
Kinesis-1
Source
Twitter-1
Broker 3
Worker
Sink
Cassandra-2
Source
Kinesis-3
Node 1 Node 2 Node 3
Fault tolerance Parallelism Elasticity Load Balancing On-demand updates
EASE OF OPERATIONS
67
✦ Brokers don’t have durable state
๏ Easily replaceable
๏ Topics are immediately reassigned to healthy brokers
✦ Expanding capacity
๏ Simply add new broker node
๏ If other brokers are overloaded, traffic will be automatically assigned
✦ Load manager
๏ Monitor traffic load on all brokers (CPU, memory, network, topics)
๏ Initially place topics to least loaded brokers
๏ Reassign topics when a broker is overloaded
CONSUMER BACKLOG
68
✦ Metrics are available to make assessments
๏ When a problem started
๏ How big is backlog? messages? disk space?
๏ How fast is draining?
๏ What’s the ETA to catch up with publishers?
✦ Establish where is the bottleneck
๏ Application is not fast enough
๏ Disk read I/O
ENFORCING MULTITENANCY
69
✦ Ensure tenants don’t cause performance issues on other tenants
๏ Backlog quotas
๏ Soft isolation - flow control and throttling
✦ In cases when user behavior is triggering performance degradation
๏ Hard isolation as a last resource for quick reaction while proper fix is deployed
๏ Isolate tenant on a subset of brokers
๏ Can be also applied at the storage node level
PULSAR PERFORMANCE
70
PULSAR PERFORMANCE - LATENCY
71
APACHE PULSAR VS. APACHE KAFKA
72
Mul$-tenancy	
A	single	cluster	can	support	many	
tenants	and	use	cases
Seamless	Cluster	Expansion	
Expand	the	cluster	without	any	
down	$me
High	throughput	&	Low	Latency	
Can	reach	1.8	M	messages/s	in	a	
single	par$$on	and	publish	latency	
of	5ms	at	99pct
Durability	
Data	replicated	and	synced	to	disk
Geo-replica$on	
Out	of	box	support	for	geographically	
distributed	applica$ons
Unified	messaging	model	
Support	both	Topic	&	Queue	
seman$c	in	a	single	model
Tiered	Storage	
Hot/warm	data	for	real	$me	access	and	
cold	event	data	in	cheaper	storage
Pulsar	Func$ons	
Flexible	light	weight	compute
Highly	scalable	
Can	support	millions	of	topics,	makes	data	
modeling	easier
73
SPARK STREAMING APACHE APEX
APACHE FLINK SAMZA
APACHE BEAM
(MILLWHEEL, DATAFLOW)
ATHENAX
STYLUS APACHE STORM
STREAMALERT APACHE HERON
Processing
PROCESSING
74
TASK ISOLATIONLOW LATENCY
MULTIPLE API'S MULTIPLE SEMANTICS
DIVERSE WORKLOADS
FAULT TOLERANCE
KEY PROPERTIES
HERON TERMINOLOGY
75
Topology
Directed	acyclic	graph		
ver6ces	=	computa6on,	and		
edges	=	streams	of	data	tuples
Spouts
Sources	of	data	tuples	for	the	topology	
Examples	-	Pulsar/KaPa/MySQL/Postgres
Bolts
Process	incoming	tuples,	and	emit	outgoing	tuples	
Examples	-	filtering/aggrega6on/join/any	func6on
,
%
HERON TOPOLOGY
76
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
HERON TOPOLOGY - PHYSICAL EXECUTION
77
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
HERON GROUPINGS
78
01
Shuffle Grouping
Random distribution of tuples
/
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
02
.
03
-
04
Global Grouping
Send the entire stream to one
task
,
HERON TOPOLOGY - PHYSICAL EXECUTION
79
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
Shuffle Grouping
Shuffle Grouping
Fields Grouping
Fields Grouping
Fields Grouping
Fields Grouping
HERON TOPOLOGY - PHYSICAL EXECUTION
80
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
Shuffle Grouping
Shuffle Grouping
Fields Grouping
Fields Grouping
Fields Grouping
Fields Grouping
HERON ARCHITECTURE
81
Scheduler
Topology 1 Topology 2 Topology N
Topology
Submission
HERON ARCHITECTURE
82
Topology Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics
Manager
Metrics
Manager
Metrics
Manager
Health
Manager
MASTER
CONTAINER
TOPOLOGY MASTER
83
Monitoring of containers Gateway for metrics Assigns role
STREAM MANAGER
84
Routes tuples Implements backpressure Processing Semantics
STREAM MANAGER
85
% %
S1 B2 B3
%
B4
STREAM MANAGER
86
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
PHYSICAL EXECUTION
STREAM MANAGER BACK PRESSURE
87
Spout based back pressureTCP backpressure Stage by stage back pressure
HERON INSTANCE
88
Runs only one task (spout/bolt)
Exposes Heron API
Collects several metrics
API
G
HERON INSTANCE
89
Stream
Manager
Metrics
Manager
Gateway
Thread
Task Execution
Thread
data-in queue
data-out queue
metrics-out queue
HERON DEPLOYMENT
90
Topology 1
Topology 2
Topology N
Heron
Tracker
Heron
API Server
Heron
Web
ZK
Cluster
Aurora Services
Observability
HERON VISUALIZATION
91
HERON TOPOLOGY COMPLEXITY
92
HERON TOPOLOGY SCALE
93
CONTAINERS - 1 TO 600 INSTANCES - 10 TO 6000
HERON HAPPY FACTS :)
94
✦ No more pages during midnight for Heron team
๏ Backlog quotas
๏ Soft isolation - flow control and throttling
✦ Very rare incidents for Heron customer teams
✦ Easy to debug during incidents for quick turn
around
๏ Reduced resource utilization saving cost (~3x cost
savings)
95
Coffee Break
HERON DEVELOPER ISSUES
96
01 02
Container	resource	alloca$on
Parallelism	tuning
HERON OPERATIONAL ISSUES
97
01 02 03
Slow Hosts Network Issues Data Skew
/ .
-
04
Load Variations
,
05
SLA Violations
/
SLOW HOSTS
98
Memory Parity Errors Impending Disk Failures Lower GHZ
NETWORK ISSUES
99
Network Partitioning
G
Network Slowness
NETWORK SLOWNESS
100
01 02 03
Delays	processing
Data	is	accumula$ng
Timelines	of	results	
Is	affected
DATA SKEW
101
Multiple Keys
Several	keys	map	into	single	
instance	and	their	count	is	
high
Single Key
Single	 key	 maps	 into	 a	 instance	
and	its	count	is	high
H
C
LOAD VARIATIONS
102
Multiple Keys
Several	keys	map	into	single	
instance	and	their	count	is	
high
Single Key
Single	 key	 maps	 into	 a	 instance	
and	its	count	is	high
H
C
SELF REGULATING STREAMING SYSTEMS
103
Automate	Tuning
SLO		
Maintenance
Self	Regula$ng	
Streaming	Systems
Tuning
Manual,	 6me-consuming	 and	
error-prone	 task	 of	 tuning	 various	
systems	knobs	to	achieve	SLOs
SLO
Maintenance	 of	 SLOs	 in	 the	 face	 of	
unpredictable	load	varia6ons	and	hardware	
or	soWware	performance	degrada6on
Self	Regula$ng	Streaming	Systems
System	that	adjusts	itself	to	the	environmental	changes	
and	con6nue	to	produce	results
SELF REGULATING STREAMING SYSTEMS
104
Self tuning Self stabilizing Self healing
G !g
Several tuning knobs
Time consuming tuning phase
The system should take
as input an SLO and
automatically configure
the knobs.
The system should react
to external shocks and
a u t o m a t i c a l l y
reconfigure itself
Stream jobs are long running
Load variations are common
The system should
identify internal faults
and attempt to recover
from them
System performance affected
by hardware or software
delivering degraded quality of
service
ENTER DHALION
105
Dhalion periodically executes well-
specified policies that optimize
execution based on some objective.
We created policies that dynamically
provision resources in the presence of
load variations and auto-tune
streaming applications so that a
throughput SLO is met.
Dhalion is a policy based framework
integrated into Heron
DHALION POLICY FRAMEWORK
106
Symptom
Detector 1
Symptom
Detector 2
Symptom
Detector 3
Symptom
Detector N
....
Diagnoser 1
Diagnoser 2
Diagnoser M
....
Resolver
Invocation
D
iagnosis
1
Diagnosis 2
D
iagnosis
M
Symptom 1
Symptom 2
Symptom 3
Symptom N
Symptom
Detection
Diagnosis
Generation
Resolution
Resolver 1
Resolver 2
Resolver M
....
Resolver
Selection
Metrics
DYNAMIC RESOURCE PROVISIONING
107
Policy
This	policy	reacts	to	unexpected	
load	varia6ons	(workload	spikes)
Goal
Goal	is	to	scale	up	and	scale	down	
the	topology	resources	as	needed	-	
while	 keeping	 the	 topology	 in	 a	
steady	state	where	back	pressure	is	
not	observed
H
C
DYNAMIC RESOURCE PROVISIONING
108
Pending Tuples
Detector
Backpressure
Detector
Processing Rate
Skew Detector
Resource Over
provisioning
Diagnoser
Resource Under
Provisioning
Diagnoser
Data Skew
Diagnoser
Resolver
Invocation
Diagnosis
Symptoms
Symptom
Detection
Diagnosis
Generation
Resolution
Metrics
Slow Instances
Diagnoser
Bolt	Scale		
Down	Resolver
Bolt	Scale		
Up	Resolver
Data	Skew	
Resolver
Restart	
Instances	
Resolver
PROCESSING HERON
PULSAR
BOOKKEEPER
UNIFICATION
109
MESSAGING
STORAGE
STREAMLIO
STREAMLIO
110
UNIFIED ARCHITECTURE
Interactive
Querying
Storm API Streamlet SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Kafka
API
STREAMLIO ARCHITECTURE
111
KEY HIGHLIGHTS
DAG PROCESSING
GEO-REPLICATION
DURABILITY
CONSISTENCY FUNCTION PROCESSING
INFINITE RETENTION
SQL
112
ETL
DATA PROCESSING
113
CLEANSING AND TRANSFORMATION
Data format consistency
M:0:Male, F:1:Female
Missing Data
Imputation
Pre-computation
Sum, Max, Min, Distance
Timestamp
Hour of the day, Date
DATA ENRICHMENT
114
MULTIPLE DATA SOURCES
Operators
Associate
Aggregate
Filter
Properties
Scalable
Fault-tolerant
Secure
Challenges
Encryption
Merge
Union
Sort
ETL DATA PIPELINES - RETAIL
115
Scenario
Consumer retail company migrating multiple data
and analytics applications to public cloud
Challenges
Needed a solution to connect data to dashboards,
data warehouse, and applications
Solution
Cloud data pipeline built on Apache Pulsar to
support data movement, transformation, and
access
ETL DATA PIPELINES - RETAIL
116
Cloud data warehouse
Batch reporting
and analytics
Real-time dashboards and
alerts
Data sources and
applications
Messaging, queuing, data
transformations
Cloud data lake
117
Enterprise Event Bus
DATA SILOS
118
INTEGRATION
Disparate Data Sources
Multiple Data Types
Different Frequency
Data Fidelity
DATA FUSION
119
EXTRACTING INSIGHTS
Why?
Complementarity → Completeness
Redundancy → Accuracy
Cooperative → Dependability
Planning
Decision-Making
Applications
Smart Buildings
Remote sensing
Transportation
Multi-Target Tracking (MTT)
120
STREAMING JOINS
CHALLENGES
Bounded sliding window of tuples
Traditional join semantics
Hard to exploit many-core parallelism
Sequential operation model (store and process)
Linear data flow model (left-to-right / right-to-left)
Performance
๏ Deployment parameters
❖ Processing capacity, level of parallelism
๏ Time-varying input parameters
❖ Varying sources, Rate of tuples
SplitJoin
*	Guilsano	et	al.,	“Performance	Modeling	of	Stream	Joins”,	DEBS	2017.
ENTERPRISE EVENT BUS
121
Apache Pulsar Cluster
Application 2
Application 5
Application 1 Application 3
Application 4 Application 6
ENTERPRISE EVENT BUS - YAHOO!
122
Scenario
Need to collect and distribute user and data
events to distributed global applications at
Internet scale
Challenges
✦ Multiple technologies to handle messaging
needs
✦ Multiple, siloed messaging clusters
✦ Hard to meet scale and performance
✦ Complex, fragile environment
Solution
✦ Central event data bus using Apache Pulsar
✦ Consolidated multiple technologies and clusters into a
single solution
✦ Fully-replicated across 8 global datacenter
✦ Processing >100B messages / day, 2.3M topics
123
UNBOUNDED
UNORDERED
EVENT TIME
PROCESSING TIME
LOW LATENCY
LIMITED MEMORY
SINGLE PASS , INCREMENTAL
APPROXIMATE
APPROXIMATION
124
Guaranteed bounds on
approximation factors
Error greater than 𝜖 with
probability 𝛿
BOUNDS
PROBABILISTICDETERMINISTIC
APPROXIMATION
125
CHEBYSHEV
HOEFFDING
MARKOV
PROBABILISTIC BOUNDS
X: Random variable, μ: expectation, 𝜖>0
126
BIAS VS. VARIANCE
PROPERTIES
BIAS VARIANCE
How much the average of the
estimate differs from the true mean
Variance of the estimate
around its mean
127
BIAS VS. VARIANCE
TRADE-OFF
*	Illustra6on	borrowed	from	h`ps://web.stanford.edu/~has6e/ElemStatLearn/
*
DATA SKETCHES
128
129
SAMPLING
FILTERING
CARDINALITY
QUANTILE
FREQUENT
MOMENTS
SPECTRUM OF
DATA SKETCHES
SAMPLING
130
Volume
๏ Unbounded
Velocity
Types
๏ Uniform
๏ Non-uniform
❖ Class imbalance
Latency
SAMPLING
131
✦ Maintain dynamic sample
๏ A data stream is a continuous process
๏ Not known in advance how many points may elapse before an analyst may need to use a
representative sample
✦ Reservoir sampling[1]
๏ Probabilistic insertions and deletions on arrival of new stream points
๏ Probability of successive insertion of new points reduces with progression of the stream
❖ An unbiased sample contains a larger and larger fraction of points from the distant history of the
stream
✦ Practical perspective
๏ Data stream may evolve and hence, the majority of the points in the sample may represent the
stale history
[1]	J.	S.	Vi`er.	Random	Sampling	with	a	Reservoir.	ACM	Transac6ons	on	Mathema6cal	SoWware,	Vol.	11(1):37–57,	March	1985.
OBTAINING A REPRESENTATIVE SAMPLE
SAMPLING
132
✦ Sliding window approach*
(sample size k, window width n)
๏ Sequence-based
❖ Replace expired element with newly arrived element
‣ Disadvantage: highly periodic
❖ Chain-sample approach
‣ Select element ith with probability Min(i,n)/n
‣ Select uniformly at random an index from [i+1, i+n] of the element which will replace the ith item
‣ Maintain k independent chain samples
๏ Timestamp-based
❖ # elements in a moving window may vary over time
❖ Priority-sample approach
*	B.	Babcock.	Sampling	From	a	Moving	Window	Over	Streaming	Data.	In	Proceedings	of	SODA,	2002.
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
SAMPLING
133
COMPRESSED SENSING
Distributed
๏ Communication cost
๏ Non-uniform sampling rates
High dimensionality
๏ Sparsity
Applications
Sensor networks
Network traffic monitoring
Facial recognition
Mitigate latency impact
134
COMPRESSED SENSING
EARLY WORK
[Candès et al. 2006]
[Donoho 2006]
[Cormode and Muthukrishnan 2006]
[Freris et al. 2013]
Recursive approach
COMPRESSED SENSING
FOR STREAMING DATA
[Yan et al. 2015]
DISTRIBUTED OUTLIER
DETECTION
COMPRESSIVE
SAMPLING
[Candès and Wakin 2008]
RELATED WORKS
135
136
Document 2015
EVENTS
E-COMMERCE OPPORTUNITY
MODEL
137
INCREMENTAL
HANDLE NON-STATIONARITY, CONCEPT DRIFT
EVOLVE CONTINUOUSLY
PERSONALIZATION
138
THE HOLY GRAIL
Continuously evolving data streams
Time decay
Multi-armed bandit
Contextual
Deep learning
Privacy
INVENTORY MANAGEMENT
139
FORECASTING
Mining retail transactions
Seasonality
Virality
New product launches
140
Real-Time Trending
141
HERBERT SIMON
“A wealth of information
creates a poverty of
attention.”
142
CONTINUOUS (PREFERENCE) TOP-K ELEMENTS
*	Moura6dis	et	al.,	“Con8nuous	monitoring	of	top-k	queries	over	sliding	windows”,	SIGMOD	2006.	
#	Yu	et	al.,	“Processing	a	large	number	of	con8nuous	preference	top-k	queries”,	SIGMOD	2012.	
^	Zhang	et	al.,	“Reverse	k-ranks	query”,	VLDB	2014.
143
TWITTER
TRENDING
NEWS
TRENDING
144
Spotify, Apple Music, Pandora
AUDIO
145
YouTube, Vimeo, DailyMotion
TRENDING
VIDEO
Facebook Live, Periscope, Twitch
146
TRENDING
LIVE VIDEO
147
FAKE
ADDRESSING INTEGRITY
148
BIAS
VARIOUS SOURCES
Data bias
๏ Public image datasets
❖ Geographic representation
๏ Induces algorithmic bias[1]
❖ Online advertising
❖ Jobs recommendation
❖ Customer profiling
Anomalies
Data fidelity
๏ Bad actors
❖ Bots
❖ Data farms
*	Figure	borrowed	from	Baeza-Yates,	“Bias	on	the	web”,	2018.	
[1]	Hajian	et	al.,	“Algorithmic	bias:	From	Discrimina8on	Discovery	to	Fairness-aware	Data	mining”,	KDD	2016.
*
149
FACT CHECKING
CROWDSOURCING
REAL TIME TRENDING IN PULSAR & HERON
150
Streamlio (Apache Pulsar and Apache Heron)
Data
Source 2
clean-fn 2
Data
Source 1
Data
Source 3
clean-fn 1
trend-
topology 3
Trending
Application
T1
T2
T3
LATENCY
151
USER EXPERIENCE
THE TAIL[1]
152
CLASSIFICATION[2]
✦ Whether elements can only be added or can be both
inserted and deleted?
๏ Cash register model
๏ Turnstile model
✦ What operations are allowed on the elements?
๏ Comparison model
๏ Fixed-universe model
✦ Whether the algorithm is deterministic/randomized?
๏ Monte Carlo randomization
QUANTILE ESTIMATION
[1]	J.	Deam	and	L.	Barroso,	“The	tail	at	scale”,	2013.	
[2]	G.	Luo	et	al.,	“Quan8les	over	data	streams:	experimental	comparisons,	new	analyses,	and	further	improvements”,	2016.
QUANTILE ESTIMATION
153
[Arasu and Manku 2005]
SLIDING WINDOWS
[Cormode et al. 2005]
[Yi et al. 2009]
CONTINUOUS
MONITORING
[Shrivastava et al. 2004]
[Agarwal et al. 2012]
DISTRIBUTED DATA
[Cormode et al. 2006]
BIASED
FLAVORS
*	Wang	et	al.,	“Quan8les	over	Data	Streams:	An	Experimental	Study”,	SIGMOD	2013.
QUANTILE ESTIMATION
154
PICK
[Blum et al. 1973]
EXACT MEDIAN
[Munro and Patterson 1980]
Q-DIGEST
[Shrivastava et al. 2002]
Interest in this problem may be traced to the realm of sports and the design of
(traditionally, tennis) tournaments to select the first- and second-best players. In 1883,
Lewis Carroll published an article denouncing the unfair method by which the second-
best player is usually determined in a "knockout tournament" -- the loser of the final
match is often not the second-best! (Any of the players who lost only to the best player
may be second-best.) Around 1930, Hugo Steinhaus brought the problem into the realm
of algorithmic complexity by asking for the minimum number of matches required to
(correctly) select both the first- and second-best players from a field of n contestants. I
T-DIGEST
[Dunning and Ertl 2017]
p passes
space
Max signal
value
# Elements
Compression
Factor
Complete binary tree
Q-DIGEST[1]
155
✦ Groups	values	in	variable	size	buckets	of	almost		equal	weights	
๏ Unlike a traditional histogram, buckets can overlap
✦ Key features
๏ Detailed information about frequent values preserved
๏ Less frequent values lumped into larger buckets
✦ Using message of size m, answer within an error of
✦ Except root and leaf nodes, a node v ∈ q-digest iff
[1]	Shrivastava	et	al.,	“Medians	and	Beyond:	New	Aggrega8on	Techniques	for	Sensor	Networks”.	SenSys,	2004.
Q-DIGEST (CONTD.)
156
✦ Building a q-digest
✦ q-digests can be constructed in a
distributed fashion
๏ Merge q-digests
T-DIGEST[1]
157
[1]	T.	Dunning	and	O.	Ertl,	”Compu8ng	Extremely	Accurate	Quan8les	using	t-digests”,	2017.	h`ps://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf	
✦ Approximation of rank-based statistics
๏ Compute quantile q with an accuracy relative to max(q, 1-q)
๏ Compute hybrid statistics such as trimmed statistics
✦ Key features
๏ Robust with respect to highly skewed distributions
๏ Independent of the range of input values (unlike q-digest)
๏ Relative error is bounded
❖ Non-equal bin sizes
❖ Few samples contribute to the bins corresponding to the extreme quantiles
๏ Merging independent t-digests
❖ Reasonable accuracy
T-DIGEST (CONTD.)
158
✦ Group samples into sub-sequences
๏ Smaller sub-sequences near the ends
๏ Larger sub-sequences in the middle
✦ Scaling function
๏ Mapping k is monotonic
❖ k(0) = 1 and k(1) = δ
❖ k-size of each subsequence < 1 s
Notional
Index
Compression
parameter
Quantile
T-DIGEST (CONTD.)
159
✦ Estimating quantile via interpolation
๏ Sub-sequences contain centroid of the samples
๏ Estimate the boundaries of the sub-sequences
✦ Error
๏ Scales quadratically in # samples
๏ Small # samples in the sub-sequences near q=0 and q=1 improves accuracy
๏ Lower accuracy in the middle of the distribution
❖ Larger sub-sequences in the middle
✦ Two flavors
๏ Progressive merging (buffering based) and clustering variant s
160
RELATED WORKS
SUMMARY DATA
STRUCTURE
[Greenwald and Khanna 2001]
[Luo et al. 2016]
RANDOM
BQ-SUMMARY
[Zhang et al. 2006]
One-scan randomized
MR
[Cormode et al. 2006]
A SMALL SNAPSHOT
161
PERFORMANCE
162
MONITORING
CPM (Cost per 1000 Impressions), vCPM, CPCV
CPC (Cost per Click)
CPI (Cost per Install)
CPE (Cost per Engagement)
CPA (Cost per Action), CPO
PERFORMANCE
163
MONITORING
AD
FORMATS
POP UNDER
INTERSTITIAL
REDIRECTBANNERS
REWARDED
VIDEO
POP UP
SOCIAL
VIDEO
FRAUD
164
DETECTION
IMPRESSION
CLICK
INSTALL
ENGAGEMENT
FRAUD DETECTION USING APACHE PULSAR
165
Apache Pulsar
Data
Source 2
clean-fn 2
Data
Source 1
Data
Source 3
clean-fn 1 fraud-fn 1
Fraud
Detection
T1
T2
T3
fraud-fn 2
IP/DEVICE ID BLACKLISTING
166
CUCKOO FILTER
[Fan et al. CoNext 2014]
QUOTIENT FILTER
[Bender et al. Hot Storage 2011]
MORTON FILTER
[Breslow and Jayasena, VLDB 2018]
BLOOM FILTER
[Bloom, CACM 1970]
FILTERING
BLOOM FILTER
167
[1]	Illustra6on	borrowed	from	h`p://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
BLOOM FILTER
168
✦		Natural generalization of hashing
✦ False positives are possible
✦ No false negatives
No deletions allowed
✦ For false positive rate ε, # hash functions = log2(1/ε)
where, n = # elements,
k = # hash functions
m = # bits in the array
VARIATIONS
169
BLOOM FILTER
[Mitzenmacher 2002]
COMPRESSED
[Fan et al. 2000]
COUNTING
[Cohen and Matias 2003]
SPECTRAL
[Chazelle et al. 2004]
BLOOMIER FILTER
[Guo et al. 2010]
BUFFERED
[Lall and Ogihara, 2007]
BITWISE
[Yoon 2010]
A2
[Canim et al. 2010]
DYNAMIC
[Einziger and Friedman, 2014]
BLOOM-g
[Debnath et al. 2011]
Bloomflash
[Naor and Yogev, 2013]
SLIDING
[Qiao et al. 2014]
Tinyset
QUOTIENT FILTER
170
✦		Supports insertion, deletion
✦ Merging/resizing feasible
✦ Lookups incur a single cache miss
✦ 20% bigger than Bloom Filter
✦ Stores p-bit fingerprints of elements
✦ Compact hash table
✦ P(hard collision) =
Employs quotienting: remainder (r least significant bits) is stored in the bucket indexed by the
quotient (q most significant bits)
where, ⍺ = n/m, m=2q
, n: # elements, m: # slots
VARIATIONS
171
QUOTIENT FILTER
[Bender et al. 2011]
CASCADE
[Pandey et al. 2017]
COUNTING RANK-AND-SELECT
✦		Multiset of integer, each of width p-bits
✦ l =log2( n/M) + O(1) in-flash QFs of exponentially increasing size, QF1, QF2, …, QFl
stored contiguously, M: RAM size
✦ Search: O(log(n/M)) block reads
✦ Insert: O((log( n/M))/B) amortized block writes/erases, B: natural block size of
the flash
[Dutta et al. 2014]
STREAMING
[Bungeroth, 2018]
CUCKOO FILTER
172
✦ Key Highlights
๏ Add and remove items dynamically
๏ For false positive rate ε < 3%, more space efficient than Bloom filter
๏ Higher performance than Bloom filter for many real workloads
๏ Asymptotically worse performance than Bloom filter
❖ Min fingerprint size α log (# entries in table)
✦ Overview
๏ Stores only a fingerprint of an item inserted
❖ Original key and value bits of each item not retrievable
๏ Set membership query for item x: search hash table for fingerprint of x
Cuckoo Hashing [1]
CUCKOO FILTER
173
[1]	R.	Pagh	and	F.	Rodler.	Cuckoo	hashing.	Journal	of	Algorithms,	51(2):122-144,	2004.	
[2]	Illustra6on	borrowed	from	“Fan	et	al.	Cuckoo	Filter:	Prac6cally	Be`er	Than	Bloom.	In	Proceedings	of	the	10th	ACM	Interna6onal	on	Conference	on	Emerging	Networking	Experiments	and	Technologies,	2014.”
[2]
Illustra6on	of	Cuckoo	hashing	[2]
✦ High space occupancy
✦ Practical implementations: multiple items/bucket
✦ Example uses: Software-based Ethernet switches
Cuckoo Filter [2]
✦ Uses a multi-way associative Cuckoo hash table
✦ Employs partial-key cuckoo hashing
๏ Store fingerprint of an item
๏ Relocate existing fingerprints to their alternative
locations
[2]
CUCKOO FILTER
174
✦ Deletion
๏ Item	must	have	been	previously	
inserted
✦ Partial-key cuckoo hashing
✦ Fingerprint hashing ensures uniform
distribution of items in the table
✦ Length of fingerprint << Size of h1 or h2
✦ Possible to have multiple entries of a
fingerprint in a bucket
Alternate	
bucket
Significantly	shorter	
than	h1	and	h2
CONCURRENT
CUCKOO HASHING
VARIATIONS
175
CUCKOO HASHING AND CUCKOO FILTER
[Li et al. 2014]
[Eppstein et al. 2017]
2-3 CUCKOO FILTER
HORTON TABLES
[Sun et al. 2017]
SMART CUCKOO
[Breslow et al. 2016]
MORTON FILTER
176
✦ Key Highlights
๏ Insertions heavily biased towards the hash function H1
❖ Fingerprint retrieval require fewer hardware cache accesses
๏ Overflow Tracking Array (OTA): simple bit vector
❖ Tracks when fingerprints cannot be placed using H1
❖ Most negative lookups only require accessing a single bucket
๏ Fullness counters
❖ Replace the empty slots (compression) and track the load of each logical bucket
❖ Reads and updates happen in-situ without explicit need for materialization
๏ Most insertions require accessing a single cache line
๏ Fewer fingerprint comparisons than a Cuckoo filter
MORTON FILTER
177
✦ Hashing
๏ Reduces TLB misses, DRAM row buffer misses and page faults
๏ Does not require total # buckets to be a power of 2
K’s fingerprint
Buckets per block
Total buckets (multiple of 2)
Bucket index where K’s
fingerprint is stored
Hash function
๏ Compute H’(K) and remap K’s fingerprint without knowing K itself
๏ For any key, candidate buckets mostly fall within the same page of memory
Power of 2
SPEND
178
DYNAMIC (RE)-ALLOCATION
SANs
Facebook, Google, Twitter, Yahoo, Snapchat
Ad Networks
Affiliates
Offer wall
TV
179
EXCHANGES
180
PROGRAMMATIC
DoubleClick
MoPub
OpenX
AppNexus
USER PROFILE
181
MULTI-DIMENSIONAL TARGETING
Platform
Device Type
OS Version
Apps Installed
Age, Gender
STATISTICAL ARBITRAGE
182
PREDICTION
Click Through Rate (CTR)
Click to Install (CTI)
Life Time Value (LTV)
Churn probability
REAL-TIME ANOMALY DETECTION
ANOMALY DETECTION
183
Temporal
Distribution based
A
N
O
M
A
L
Y
DIFFERENT FLAVORS
safaribooksonline.com/library/view/understanding-anomaly-detection/9781491983676/
“On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al. 2017. (https://arxiv.org/abs/1710.04735)
ANOMALY DETECTION
184
ROOTED IN STUDIES IN PROBABILITY
1777
1713
Reconciling discrepant observations
Euler 1749
Irregularities in the motion of Saturn and Jupiter
Mayer 1750
Study of lunar libration
Boscovich 1755
Measurements of the mean ellipticity
of the Earth
ANOMALY DETECTION
185
LONG HISTORY
First formalization of an exact rule for rejection of anomalies
ANOMALY DETECTION
186
CHARACTERISTICS
MAGNITUDE
Severity
WIDTH
Actionability
FREQUENCY
Reliability
DIRECTION
Positive/Negative
Global
Local
ANOMALY DETECTION
187
WHY IT’S NON-TRIVIAL
SEASONALITY TREND BREAKOUTSTATIONARITYNOISE
ANOMALY DETECTION
188
ASSUMPTIONS
Normality, Stationarity
MOVING AVERAGES
Params: Width, Decay
RULE BASED
µ ± σ
COMMON APPROACHES
ANOMALY DETECTION
189
MAD[1]
Median Absolute
Deviation
MEDIAN
MCD[2]
Minimum Covariance
Determinant
MVEE[3,4]
Minimum Volume
Enclosing Ellipsoid
ROBUST MEASURES
[1]P.	J.	Rousseeuw	and	C.	Croux,	“Alterna6ves	to	the	Median	Absolute	Devia6on”,	1993.	
[2]	h`p://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract	
[3]	P.	J.	Rousseeuw	and	A.	M.	Leroy.,“Robust	Regression	and	Outlier	Detec6on”,	1987.	
[4]	M.	J.Todda	and	E.	A.	Yıldırım	,	“On	Khachiyan's	algorithm	for	the	computa6on	of	minimum-volume	enclosing	ellipsoids”,	2007.
ANOMALY DETECTION
190
CONTEXT
✦ Data types
๏ Text, Audio, Video
✦ Data veracity
๏ Wearables
๏ Smart cities, Connected
Home
๏ Internet of Things
LIVE DATA
✦ Multi-dimensional
✦ Mow memory footprint
✦ Accuracy-speed tradeoff
CHALLENGES
191
ANOMALY DETECTION
DIFFERENT APPLICATION DOMAINS
RADIO ANOMALY
DETECTION
MARKER
DISCOVERY
CROWDED SCENE
ANALYSIS
ANOMALOUS
HUMAN BEHAVIOR
HUMAN TRAJECTORY
PREDICTION
[Popoola and Wang 2012] [Alahi et al. 2016]
[O’Shea et al. 2016]
[Schegl et al. 2017] [Sabokrou et al. 2017]
192
Real-Time Security
THREAT DETECTION
193
POTENTIAL HIGH BUSINESS DOWNSIDE
DDos Attack
Network Intrusion
Harassement
Account hacking
Identity theft
Purchases
MONITORING
194
AUDIT LOGS
Answering the “What, When, Who, Why”
Downloading of unusual amount of data
Troubleshoot systems, networks
Real-time packet analysis
Challenge: Encryption
Suspicious access pattern
REAL TIME SECURITY USING APACHE PULSAR
195
Apache Pulsar
Data
Source 2
threat-fn 2
Data
Source 1
Data
Source 3
threat-fn 1
Query
Audit Logs
(SQL)
Streaming
threats &
anomalies
T1
T2
T3
Audit Log
Investigation
DEVICE IDS/IP ADDRESSES
196
FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS
✦ Require large amount of memory
✦ Long tail - low business value
EXACT
✦ Types
๏ Counter-based
❖ Example: Frequent Algorithm [Demaine et al. 2002]
๏ Sketch-based
❖ Example: GroupTest [Cormode and Muthukrishnan, 2002]
✦ Parameters
๏ # counters, # update operations, error
APPROXIMATE
FLAVORS
197
FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS
l 1 heavy hitter
๏ m can be known/unknown
๏ Deterministic
❖ [Misra and Gries 1982, Karp et al. 2003]
๏ Randomized
❖ [Estan and Varghese 2003, Kumar and Xu 2006]
[1]	Woodruff,	2016.	“New	Algorithms	for	Heavy	Hi`ers	in	Data	Streams”.ICDT.
Frequency	of	item	i
#	elements		
in	stream
Output
FLAVORS
198
FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS
l 2 heavy hitter
๏ Example
❖ [Charikar et al. 2004]
๏ Applications
❖ Compressed sensing
❖ Numerical linear algebra
❖ Cascaded aggregates
Frequency	of	item	i
#	dis6nct	items	
Output
COUNT-MIN SKETCH[1]
199
✦ A two-dimensional array counts with w columns and d rows
✦ Each entry of the array is initially zero
✦ d hash functions are chosen uniformly at random from a pairwise independent family
✦ Update
๏ For a new element i, for each row j and k = hj(i), increment the kth column by one
✦ Point query where, sketch is the table
✦ Parameters
[1]	Cormode,	Graham;	S.	Muthukrishnan	(2005).	"An	Improved	Data	Stream	Summary:	The	Count-Min	Sketch	and	its	Applica6ons".	J.	Algorithms	55:	29–38.
),( δε
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
}1{}1{:,,1 wnhh d ……… →
COUNT-MIN SKETCH
200
✦ Count-Min sketch with conservative update (CU sketch)
๏ Update an item with frequency c
๏ Avoid unnecessary updating of counter values => Reduce over-estimation error
๏ Prone to over-estimation error on low-frequency items
✦ Lossy Conservative Update (LCU) - SWS
๏ Divide stream into windows
๏ At window boundaries, ∀ 1 ≤ i ≤ w, 1 ≤ j ≤ d, decrement sketch[i,j] if 0 < sketch[i,j] ≤
[1]	Cormode,	G.	2009.	Encyclopedia	entry	on	’Count-MinSketch’.	In	Encyclopedia	of	Database	Systems.	Springer.,	511–516.
VARIANTS
[1]
COUNTER TREE[1]
201
✦ Motivation
๏ Equi-bitwidth counters → inefficient space usage owing variance in counts
✦ Two-dimensional counter sharing
✦ m virtual counters (V[i], 0 ≤ i < m)
๏ Path from a leaf node to the root
๏ Counter at layer j in V[i] given by:
[1]	Chen	et	al.,	“Counter	Tree:	A	Scalable	Counter	Architecture	for	Per-FLow	Traffic	Measurement”,	TON	2017.	
#	leaf	nodes
Total	space	(#	bits)
Counter	bit	width
Tree	height
Degree	of	a	non-leaf	
node
COUNTER TREE (CONTD.)
202
✦ Counting range
✦ Virtual counter array Vf
๏ Vf[i] = V[hi(f)], 0 ≤ i < r,
✦ Recording
๏ Increment a counter in Vf, chosen uniformly
at random, by one
๏ May result in update of multiple counters
#	counters	in	
the	bo`om	
layer
COUNTER TREE (CONTD.)
203
✦ Estimation
๏ CT based Estimation (CTE)
❖ k = dh-1
❖ Xi is the value of the subtree rooted at C[v], where C[v] is the component counter of V[u] at the highest
level and
๏ CT based Maximum Likelihood Estimation (CTM)
Reduce positive bias (overestimation)
VARIATIONS
204
FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS
LOSSY COUNTING
[Manku and Motwani, 2002]
[Homem and Carvalho, 2010]
FILTERED SPACE
SAVING
COUNT-SKETCH
[Dimitropoulos et al. 2008]
PROBABILISTIC
LOSSY COUNTING
[Charikar et al, 2002]
[Pitel and Fouquier, 2015]
COUNT-MIN-LOG
SPACE SAVING
ALGORITHM
[Metwally et al. 2005]
FDPM
[Yu etal., 2006]
[Sivaraman et al. 2017]
HASHPIPE
205
HEALTH CARE
206
WEARABLES
Smart watches, smart glasses, smart clothing,
implantables
Data veracity
๏ Hear rate (ECG and HRV)
๏ Brainwave (EEG)
๏ Muscle bio-signals (EMG)
๏ Blood pressure
๏ Calories burned
๏ Body temperature
๏ Steps walked
Other applications
๏ Blood alcohol content
๏ Athletic performance
MONITORING
207
OIL DRILLING
Early detection of leaks and/or spills
๏ Oil
๏ Flammable gases
๏ Hazardous chemicals
Equipment health
๏ Corrosion
Operational metrics
๏ Pressure
๏ Temperature
๏ Flow
๏ Air quality
QUALITY OF SERVICE
208
CUSTOMER FOCUS
WiFi, Video streaming, E-commerce
Metrics
๏ Availability
๏ Throughput
๏ Response time
๏ Document Complete
๏ Jitter
๏ Packet loss ratio
๏ Delivery success rate (messaging)
๏ Location accuracy
Common issues
๏ Anomalies, Breakouts
INDUSTRIAL IOT USING APACHE PULSAR
209
Apache Pulsar Cloud
Data
Source 2
clean-fn 2
Data
Source 1
Data
Source 3clean-fn 1
logic-fn 1
T1 T2
T3
logic-fn 2
Apache Pulsar
Edge
filter-fn 1 filter-fn 1
NUMBER OF DISTINCT DATA SOURCES
210
CARDINALITY ESTIMATION
✦ Hash values as strings
✦ Occurrence of particular patterns in the binary representation
✦ Example: Hyperloglog [Flajolet et al. 2008]
BIT-PATTERN OBSERVABLES
✦ Hash values as real numbers
✦ k-th smallest value
๏ Insensitive to distribution of repeated values
✦ Examples: MinCount [Giroire, 2000]
ORDER STATISTIC OBSERVABLES
CARDINALITY ESTIMATION
211
ANOTHER CLASSIFICATION
SKETCH-BASED
Scan the entire data set once
Hash the items
Create sketch
SAMPLING
Potentially large estimation error
UNIFORM HASHING
Employs a uniform hash function
LOGARITHMIC HASHING
Keeps track of the most uncommon
element observed so far
Example: Hyperloglog
INTERVAL BASED
How packed an interval is?
BUCKET BASED
P(bucket is non-empty)
HYPERLOGLOG
212
where
✦ Apply hash function h to every element in a multiset
✦ Cardinality of multiset is 2max(ϱ) where 0ϱ-11 is the bit pattern observed at the beginning
of a hash value
✦ Above suffers with high variance
๏ Employ stochastic averaging
๏ Partition input stream into m sub-streams Si using first p bits of hash values (m = 2p)
HYPERLOGLOG
213
OPTIMIZATIONS
✦ Use of 64-bit hash function
๏ Total memory requirement 5 * 2p -> 6 * 2p, where p is the precision
✦ Empirical bias correction
๏ Uses empirically determined data for cardinalities smaller than 5m and uses the unmodified
raw estimate otherwise
✦ Sparse representation
๏ For n≪m, store an integer obtained by concatenating the bit patterns for idx and ϱ(w)
๏ Use variable length encoding for integers that uses variable number of bytes to represent
integers
๏ Use difference encoding - store the difference between successive elements
✦ Other optimizations [1, 2]
[1]	h`p://druid.io/blog/2014/02/18/hyperloglog-op6miza6ons-for-real-world-systems.html	
[2]	h`p://an6rez.com/news/75
MIN-COUNT
VARIATIONS
214
CARDINALITY ESTIMATION
[Giroire, 2009]
Optimal statistical efficiency
[Ting, 2014]
DISCRETE MAX-COUNT
F0 AND L0 ESTIMATION
[Chen et al. 2011]
S-BITMAP
[Kane et al. 2010]
Optimal space complexity
WRAPPING UP
215
PLATFORM UNIFICATION APPLICATIONS ANALYTICS
QUESTIONS
216
STAY IN TOUCH
@arun_kejariwal
@karthikz
TWITTER
EMAIL
arun_kejariwal@acm.org
karthik@streaml.io
217
218
@arun_kejariwal @karthikz
READINGS
219
Data Streams: Algorithms and Applications”,
by M. Muthukrishnan
“Sketching as a Tool for Numerical Linear Algebra”,
by D. Woodruff
“Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”,
by G. Cormode, M. Garofalakis and P. J. Haas
“Data Streams: Models and Algorithms”,
by C. Aggarwal.
“Graph Streaming Algorithms”,
by A. McGregor
“Data Sketching”, by G. Cormode.
READINGS
220
“The algebra of stream processing functions”, TCS 2001.
STREAM CALCULUS
“A conductive calculus of streams”, MSCS 2005.
“Temporal stream algebra”, 2012.
“Stream processing col-algebraically”, SCP 2013.
“An algebra for pattern matching, time aware aggregates and
partitions on relational data streams”, DEBS 2015.
READINGS
221
“Streaming Algorithms for Robust Distinct
Elements”, SIGMOD 2016.
“Pyramid Sketch”, VLDB 2017. “Bias-Aware Sketches”, VLDB 2017.
“Efficient Adaptive Detection of Complex
Event Patterns”, VLDB 2018.
“Cardinality Estimation: An Experimental
Survey”, VLDB 2018.
“Augmented Sketch: Faster and More Accurate
Stream Processing”, SIGMOD 2016.
“BPTree: an L2 heavy hitters algorithm using
constant memory”, PODS 2017.
“A comparative analysis of state-of-the-art SQL-
on-Hadoop systems for interactive analytics”,
BigData 2018.
“Unsupervised real-time anomaly detection for
streaming data”, Neurocomputing2017.
“Time Adaptive Sketches (Ada-Sketches) for
Summarizing Data Streams”, SIGMOD 2016.
222
READINGS
“Streaming Anomaly Detection Using Randomized
Matrix Sketching”, VLDB 2015.
“Graph Sketches: Sparsification, Spanners, and Subgraphs”,
PODS 2012.
“Stream Order and Order Statistics: Quantile Estimation
in Random-Order Streams”, SIAM JoC 2009.
“Querying and mining data streams: You only get one
look”, SIGMOD 2002.
“Models and Issues in Data Stream Systems”, PODS 2002.
“Clustering Data Streams”, FOCS 2000.
“Persistent Data Sketching”, VLDB 2015.
“SCREEN: Stream Data Cleaning under Speed
Constraints”, SIGMOD 2015.
“An optimal algorithm for the distinct elements
problem”, PODS 2010.
“Statistical Analysis of Sketch Estimators”, SIGMOD 2007.
OPEN SOURCE
223
Data Sketches (https://datasketches.github.io/)
Algebird (https://github.com/twitter/algebird)
StreamDM (http://huawei-noah.github.io/streamDM/)
stream (https://cran.r-project.org/web/packages/stream/)
FluRS (https://github.com/takuti/flurs)
RESOURCES
224
https://www.slideshare.net/arunkejariwal/correlation-analysis-on-live-data-streams (O’Reilly Strata)
https://www.slideshare.net/arunkejariwal/modern-realtime-streaming-architectures (O’Reilly Strata)
https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265 (O’Reilly Strata)
https://www.slideshare.net/arunkejariwal/anomaly-detection-in-realtime-data-streams-using-heron
(O’Reilly Strata)
https://www.slideshare.net/arunkejariwal/real-time-analytics-algorithms-and-systems (VLDB 2015)
RESOURCES
225
http://www.cs.toronto.edu/~koudas/courses/csc2508/SQL-on-Hadoop-Final.pdf (VLDB 2015)
https://www.cse.ust.hk/vldb2002/program-info/tutorial-slides/T5garofalalis.pdf (VLDB 2002)
https://datasketches.github.io/docs/Research.html
https://people.cs.umass.edu/~mcgregor/slides/07-nipsslides.pdf (NIPS 2007)
http://dimacs.rutgers.edu/~graham/pubs/slides/streammining.pdf
RESOURCES
226
https://people.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf
http://dmac.rutgers.edu/Workshops/WGUnifyingTheory/Slides/cormode.pdf
https://archive.siam.org/meetings/sdm13/gupta.pdf (SDM 2013)
http://www.outlier-analytics.org/odd13kdd/papers/slides_charu_aggarwal.pdf (ODD 2013)
https://www.siam.org/meetings/sdm08/TS2.ppt (SDM 2008)
RESOURCES
227
https://www.cc.gatech.edu/~jx/reprints/talks/sigm07_tutorial.pdf (SIGMETRICS 2007)
hanj.cs.illinois.edu/bk3/bk3_slides/12Outlier.ppt
https://gist.github.com/debasishg/8172796
https://www.euroscipy.org/2017/descriptions/19827.html

Contenu connexe

Tendances

[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...Vinu Charanya
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceDatabricks
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesDatabricks
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Aditya Yadav
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalconfluent
 
A Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesA Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesDaniel Mescheder
 
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWSBridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWSconfluent
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Mathieu Dumoulin
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Mathieu Dumoulin
 
Event Data Processing with Streamlio
Event Data Processing with StreamlioEvent Data Processing with Streamlio
Event Data Processing with StreamlioStreamlio
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Aditya Yadav
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 

Tendances (20)

[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
 
A Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesA Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data Pipelines
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWSBridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 
Event Data Processing with Streamlio
Event Data Processing with StreamlioEvent Data Processing with Streamlio
Event Data Processing with Streamlio
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 

Similaire à Designing Modern Streaming Data Applications

Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupKarthik Ramasamy
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarKarthik Ramasamy
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationSean Chittenden
 
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...Peter Broadhurst
 
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud BoundariesGDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud BoundariesJames Anderson
 
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...IRJET Journal
 
Implementing Domain Events with Kafka
Implementing Domain Events with KafkaImplementing Domain Events with Kafka
Implementing Domain Events with KafkaAndrei Rugina
 
BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...
BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...
BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...Big Data Value Association
 
Bee brief-intro-q42016
Bee brief-intro-q42016Bee brief-intro-q42016
Bee brief-intro-q42016wahyu prayudo
 
Microservices meetup April 2017
Microservices meetup April 2017Microservices meetup April 2017
Microservices meetup April 2017SignalFx
 
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...Vinu Charanya
 
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open CloudCoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open CloudAta Turk
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rullimfrancis
 
The evolution of data center network fabrics
The evolution of data center network fabricsThe evolution of data center network fabrics
The evolution of data center network fabricsCisco Canada
 
Creating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarCreating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarKarthik Ramasamy
 
The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012Lucas Jellema
 

Similaire à Designing Modern Streaming Data Applications (20)

Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 
Report
ReportReport
Report
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
 
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud BoundariesGDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
 
126 Golconde
126 Golconde126 Golconde
126 Golconde
 
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...
 
Implementing Domain Events with Kafka
Implementing Domain Events with KafkaImplementing Domain Events with Kafka
Implementing Domain Events with Kafka
 
BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...
BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...
BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ce...
 
Bee brief-intro-q42016
Bee brief-intro-q42016Bee brief-intro-q42016
Bee brief-intro-q42016
 
Microservices meetup April 2017
Microservices meetup April 2017Microservices meetup April 2017
Microservices meetup April 2017
 
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
 
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open CloudCoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
CoolDC'16: Seeing into a Public Cloud: Monitoring the Massachusetts Open Cloud
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
The evolution of data center network fabrics
The evolution of data center network fabricsThe evolution of data center network fabrics
The evolution of data center network fabrics
 
Creating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarCreating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache Pulsar
 
The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012
 
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
 
GemFire In-Memory Data Grid
GemFire In-Memory Data GridGemFire In-Memory Data Grid
GemFire In-Memory Data Grid
 

Plus de Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The EdgeArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series DataArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly DetectionArun Kejariwal
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronArun Kejariwal
 
Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponArun Kejariwal
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceArun Kejariwal
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionArun Kejariwal
 
A Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldArun Kejariwal
 
Isolating Events from the Fail Whale
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail WhaleArun Kejariwal
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintArun Kejariwal
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudArun Kejariwal
 

Plus de Arun Kejariwal (20)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Data Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action UponData Data Everywhere: Not An Insight to Take Action Upon
Data Data Everywhere: Not An Insight to Take Action Upon
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient FashionGimme More! Supporting User Growth in a Performant and Efficient Fashion
Gimme More! Supporting User Growth in a Performant and Efficient Fashion
 
A Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real WorldA Systematic Approach to Capacity Planning in the Real World
A Systematic Approach to Capacity Planning in the Real World
 
Isolating Events from the Fail Whale
Isolating Events from the Fail WhaleIsolating Events from the Fail Whale
Isolating Events from the Fail Whale
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
 

Dernier

Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 

Dernier (20)

Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 

Designing Modern Streaming Data Applications

  • 1. 1 DESIGNING MODERN STREAMING DATA APPLICATIONS ARUN KEJARIWAL, KARTHIK RAMASAMY @arun_kejariwal @karthikz
  • 3. 3
  • 5. 5 Ubiquity of Real-Time Data Streams
  • 7. UNLOCKING INSIGHTS 7 GUIDE DECISION MAKING DETERMINING TOP TRAFFIC INTENSIVE IP ADDRESSES IDENTIFYING ARBITRAGE OPPORTUNITIES IN FINANCIAL TRAINING DETERMINING SESSIONS WHOSE DURATION >2X THAN NORMAL FORECASTING DETERIORATION IN HEALTH METRICS # UNIQUE USERS/DAY OF A MOBILE APP TARGETED ADVERTISING
  • 9. STREAMING/FAST DATA PROCESSING 9 ✦ Events are analyzed and processed as they arrive ✦ Decisions are timely, contextual and based on fresh data ✦ Decision latency is eliminated ✦ Data in motion Ingest/ Buffer Analyze Act
  • 11. STREAM PROCESSING PATTERN 11 ComputeMessaging Storage Data Inges6on Data Processing Results StorageData Storage Data Serving
  • 13. 13 Landscape of Platforms for Streaming Data
  • 15. 15 APACHE KAFKA RABBITMQ AKKA METAQ DATABUS ACTIVEMQ APACHE FLUME KESTREL SCRIBE APACHE PULSAR Messaging
  • 16. NEXT GENERATION MESSAGING 16 DURABILITYCONSISTENCY EASE OF OPERATIONS ISOLATION RESILIENCY SCALABILITY KEY PROPERTIES b b b b b b INFINITE RETENTIONb
  • 17. APACHE PULSAR 17 Flexible Messaging + Streaming System backed by a durable log storage
  • 18. APACHE PULSAR 18 Bookie Bookie Bookie Broker Broker Broker Producer Consumer SERVING Brokers can be added independently Traffic can be shifted quickly across brokers STORAGE Bookies can be added independently New bookies will ramp up traffic quickly
  • 19. APACHE PULSAR - BROKER 19 ✦ Broker is the only point of interaction for clients (producers and consumers) ✦ Brokers acquire ownership of group of topics and “serve” them ✦ Broker has no durable state ✦ Provides service discovery mechanism for client to connect to right broker
  • 20. APACHE PULSAR - BROKER 20
  • 21. APACHE PULSAR - CONSISTENCY 21 Bookie Bookie BookieBrokerProducer
  • 22. APACHE PULSAR - DURABILITY 22 Bookie Bookie BookieBrokerProducer Journal Journal Journal fsync fsync fsync
  • 23. APACHE PULSAR - ISOLATION 23
  • 24. APACHE PULSAR - SEGMENT STORAGE 24 234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4
  • 25. APACHE PULSAR - RESILIENCY 25 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4
  • 26. APACHE PULSAR - SEAMLESS CLUSTER EXPANSION 26 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Y Segment Z Segment X
  • 27. APACHE PULSAR - TIERED STORAGE 27 Low Cost Storage 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1Segment 4 Segment 4
  • 28. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? 28
  • 29. PARTITIONS VS. SEGMENTS - WHY SHOULD YOU CARE? 29 ✦ In Kafka, partitions are assigned to brokers “permanently” ✦ A single partition is stored entirely in a single node ✦ Retention is limited by a single node storage capacity ✦ Failure recovery and capacity expansion require expensive “rebalancing” ✦ Rebalancing has a big impact over the system, affecting regular traffic
  • 30. UNIFIED MESSAGING MODEL - STREAMING 30 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription A M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 X Exclusive
  • 31. UNIFIED MESSAGING MODEL - STREAMING 31 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription B M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 Failover In case of failure in consumer 1
  • 32. UNIFIED MESSAGING MODEL - QUEUING 32 Pulsar topic/ partition Producer 2 Producer 1 Consumer 2 Consumer 3 Subscription C M4 M3 M2 M1 M0 Shared Traffic is equally distributed across consumers Consumer 1 M4M3 M2M1M0
  • 33. DISASTER RECOVERY 33 Topic (T1) Topic (T1) Topic (T1) Subscrip6on (S1) Subscrip6on (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C Integrated in the broker message flow Simple configuration to add/remove regions Asynchronous (default) and synchronous replication
  • 34. MULTITENANCY 34 Apache Pulsar Cluster Product Safety ETL Fraud Detection Topic-1 Account History Topic-2 User Clustering Topic-1 Risk Classification MarketingCampaigns ETL Topic-1 Budgeted Spend Topic-2 Demographic Classification Topic-1 Location Resolution Data Serving Microservice Topic-1 Customer Authentication 10 TB 7 TB 5 TB 3 TB 2 TB 8 TB 4 TB 3 TB ✦ Authentication ✦ Authorization ✦ Software isolation ๏ Storage quotas, flow control, back pressure, rate limiting ✦ Hardware isolation ๏ Constrain some tenants on a subset of brokers/bookies
  • 35. PULSAR CLIENTS 35 Apache Pulsar Cluster Java Python Go C++ C
  • 36. PULSAR PRODUCER 36 PulsarClient client = PulsarClient.create( “http://broker.usw.example.com:8080”); Producer producer = client.createProducer( “persistent://my-property/us-west/my-namespace/my-topic”); // handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync("my-message".getBytes()).thenRun(() -> { // Message was persisted });
  • 37. PULSAR CONSUMER 37 PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Consumer consumer = client.subscribe( "persistent://my-property/us-west/my-namespace/my-topic", "my-subscription-name"); while (true) { // Wait for a message Message msg = consumer.receive(); System.out.println("Received message: " + msg.getData()); // Acknowledge the message so that it can be deleted by broker consumer.acknowledge(msg); }
  • 38. SCHEMA REGISTRY 38 ✦ Provides type safety to applications built on top of Pulsar ✦ Two approaches ๏ Client side - type safety enforcement up to the application ๏ Server side - system enforces type safety and ensures that producers and consumers remain synced ✦ Schema registry enables clients to upload data schemas on a topic basis. ✦ Schemas dictate which data types are recognized as valid for that topic
  • 39. PULSAR SCHEMAS - HOW DO THEY WORK? 39 ✦ Enforced at the topic level ✦ Pulsar schemas consists of ๏ Name - Name refers to the topic to which the schema is applied ๏ Payload - Binary representation of the schema ๏ Schema type - JSON, Protobuf and Avro ๏ User defined properties - Map of strings to strings (application specific - e.g git hash of the schema)
  • 40. SCHEMA VERSIONING 40 PulsarClient client = PulsarClient.builder() .serviceUrl(“http://broker.usw.example.com:6650") .build() Producer<SensorReading> producer = client.newProducer(JSONSchema.of(SensorReading.class)) .topic(“sensor-data”) .sendTimeout(3, TimeUnit.SECONDS) .create() Scenario What happens No schema exists for the topic Producer is created using the given schema Schema already exists; producer connects using the same schema that’s already stored Schema is transmitted to the broker, determines that it is already stored Schema already exists; producer connects using a new schema that is compatible Schema is transmitted, compatibility determined and stored as new schema
  • 41. HOW TO PROCESS DATA MODELED AS STREAMS 41 ✦ Consume data as it is produced (pub/sub) ✦ Light weight compute - transform and react to data as it arrives ✦ Heavy weight compute - continuous data processing ✦ Interactive query of stored streams
  • 42. LIGHT WEIGHT COMPUTE 42 f(x) Incoming Messages Output Messages ABSTRACT VIEW OF COMPUTE REPRESENTATION
  • 43. TRADITIONAL COMPUTE REPRESENTATION 43 DAG % % % % % Source 1 Source 2 Action Action Action Sink 1 Sink 2
  • 44. REALIZING COMPUTATION - EXPLICIT CODE 44 public static class SplitSentence extends BaseBasicBolt { @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) { String sentence = tuple.getStringByField("sentence"); String words[] = sentence.split(" "); for (String w : words) { basicOutputCollector.emit(new Values(w)); } } } STITCHED BY PROGRAMMERS
  • 45. REALIZING COMPUTATION - FUNCTIONAL 45 Builder.newBuilder() .newSource(() -> StreamletUtils.randomFromList(SENTENCES)) .flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+"))) .reduceByKeyAndWindow(word -> word, word -> 1, WindowConfig.TumblingCountWindow(50), (x, y) -> x + y);
  • 46. TRADITIONAL REAL TIME - SEPARATE SYSTEMS 46 Messaging Compute
  • 47. TRADITIONAL REAL TIME SYSTEMS 47 DEVELOPER EXPERIENCE ✦ Powerful API but complicated ๏ Does everyone really need to learn functional programming? ✦ Configurable and scalable but management overhead ✦ Edge systems have resource and management constraints
  • 48. TRADITIONAL REAL TIME SYSTEMS 48 OPERATIONAL EXPERIENCE ✦ Multiple systems to operate ๏ IoT deployments routinely have thousands of edge systems ✦ Semantic differences ๏ Mismatch and duplication between systems ๏ Creates developer and operator friction
  • 49. LESSONS LEARNED - USE CASES 49 ✦ Data transformations ✦ Data classification ✦ Data enrichment ✦ Data routing ✦ Data extraction and loading ✦ Real time aggregation ✦ Microservices Significant set of processing tasks are exceedingly simple
  • 50. EMERGENCE OF CLOUD - SERVERLESS 50 ✦ Simple function API ✦ Functions are submitted to the system ✦ Runs per events ✦ Composition APIs to do complex things ✦ Wildly popular
  • 51. SERVERLESS VS. STREAMING 51 ✦ Both are event driven architectures ✦ Both can be used for analytics and data serving ✦ Both have composition APIs ๏ Configuration based for serverless ๏ DSL based for streaming ✦ Serverless typically does not guarantee ordering ✦ Serverless is pay per action
  • 52. STREAM NATIVE COMPUTE USING FUNCTIONS 52 ✦ Simplest possible API -function or a procedure ✦ Support for multi language ✦ Use of native API for each language ✦ Scale developers ✦ Use of message bus native concepts - input and output as topics ✦ Flexible runtime - simple standalone applications vs managed system applications APPLYING INSIGHT GAINED FROM SERVERLESS
  • 53. PULSAR FUNCTIONS 53 SDK LESS API import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } }
  • 54. PULSAR FUNCTIONS 54 SDK API import org.apache.pulsar.functions.api.PulsarFunction; import org.apache.pulsar.functions.api.Context; public class ExclamationFunction implements PulsarFunction<String, String> { @Override public String process(String input, Context context) { return input + "!"; } }
  • 55. PULSAR FUNCTIONS 55 ✦ Simplest possible API -function or a procedure ✦ Support for multi language ✦ Use of native API for each language ✦ Scale developers ✦ Use of message bus native concepts - input and output as topics ✦ Flexible runtime - simple standalone applications vs managed system applications
  • 56. PULSAR FUNCTIONS 56 ✦ Function executed for every message of input topic ✦ Support for multiple topics as inputs ✦ Function output goes into output topic - can be void topic as well ✦ SerDe takes care of serialization/deserialization of messages ๏ Custom SerDe can be provided by the users ๏ Integration with schema registry
  • 57. PROCESSING GUARANTEES 57 ✦ ATMOST_ONCE ๏ Message acked to Pulsar as soon as we receive it ✦ ATLEAST_ONCE ๏ Message acked to Pulsar after the function completes ๏ Default behavior - don’t want people to loose data ✦ EFFECTIVELY_ONCE ๏ Uses Pulsar’s inbuilt effectively once semantics ✦ Controlled at runtime by user
  • 58. DEPLOYING FUNCTIONS - BROKER 58 Broker 1 Worker Function wordcount-1 Function transform-2 Broker 1 Worker Function transform-1 Function dataroute-1 Broker 1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3
  • 59. DEPLOYING FUNCTIONS - WORKER NODES 59 Worker Function wordcount-1 Function transform-2 Worker Function transform-1 Function dataroute-1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3 Broker 1 Broker 2 Broker 3 Node 4 Node 5 Node 6
  • 60. DEPLOYING FUNCTIONS - KUBERNETES 60 Function wordcount-1 Function transform-1 Function transform-3 Pod 1 Pod 2 Pod 3 Broker 1 Broker 2 Broker 3 Pod 7 Pod 8 Pod 9 Function dataroute-1 Function wordcount-2 Function transform-2 Pod 4 Pod 5 Pod 6
  • 61. BUILT-IN STATE MANAGEMENT IN FUNCTIONS 61 ✦ Functions can store state in inbuilt storage ๏ Framework provides a simple library to store and retrieve state ✦ Support server side operations like counters ✦ Simplified application development ๏ No need to standup an extra system
  • 62. DISTRIBUTED STATE IN FUNCTIONS 62 import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.PulsarFunction; public class CounterFunction implements PulsarFunction<String, Void> { @Override public Void process(String input, Context context) throws Exception { for (String word : input.split(".")) { context.incrCounter(word, 1); } return null; } }
  • 63. PULSAR - DATA IN AND OUT 63 ✦ Users can write custom code using Pulsar producer and consumer API ✦ Challenges ๏ Where should the application to publish data or consume data from Pulsar? ๏ How should the application to publish data or consume data from Pulsar? ✦ Current systems have no organized and fault tolerant way to run applications that ingress and egress data from and to external systems
  • 64. PULSAR IO TO THE RESCUE 64 Apache Pulsar ClusterSource Sink
  • 65. INTERACTIVE QUERYING OF STREAMS - PULSAR SQL 65 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Reader Segment Reader Segment Reader Segment Reader Coordinator
  • 66. PULSAR IO - EXECUTION 66 Broker 1 Worker Sink Cassandra-1 Source Kinesis-2 Broker 2 Worker Source Kinesis-1 Source Twitter-1 Broker 3 Worker Sink Cassandra-2 Source Kinesis-3 Node 1 Node 2 Node 3 Fault tolerance Parallelism Elasticity Load Balancing On-demand updates
  • 67. EASE OF OPERATIONS 67 ✦ Brokers don’t have durable state ๏ Easily replaceable ๏ Topics are immediately reassigned to healthy brokers ✦ Expanding capacity ๏ Simply add new broker node ๏ If other brokers are overloaded, traffic will be automatically assigned ✦ Load manager ๏ Monitor traffic load on all brokers (CPU, memory, network, topics) ๏ Initially place topics to least loaded brokers ๏ Reassign topics when a broker is overloaded
  • 68. CONSUMER BACKLOG 68 ✦ Metrics are available to make assessments ๏ When a problem started ๏ How big is backlog? messages? disk space? ๏ How fast is draining? ๏ What’s the ETA to catch up with publishers? ✦ Establish where is the bottleneck ๏ Application is not fast enough ๏ Disk read I/O
  • 69. ENFORCING MULTITENANCY 69 ✦ Ensure tenants don’t cause performance issues on other tenants ๏ Backlog quotas ๏ Soft isolation - flow control and throttling ✦ In cases when user behavior is triggering performance degradation ๏ Hard isolation as a last resource for quick reaction while proper fix is deployed ๏ Isolate tenant on a subset of brokers ๏ Can be also applied at the storage node level
  • 71. PULSAR PERFORMANCE - LATENCY 71
  • 72. APACHE PULSAR VS. APACHE KAFKA 72 Mul$-tenancy A single cluster can support many tenants and use cases Seamless Cluster Expansion Expand the cluster without any down $me High throughput & Low Latency Can reach 1.8 M messages/s in a single par$$on and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replica$on Out of box support for geographically distributed applica$ons Unified messaging model Support both Topic & Queue seman$c in a single model Tiered Storage Hot/warm data for real $me access and cold event data in cheaper storage Pulsar Func$ons Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier
  • 73. 73 SPARK STREAMING APACHE APEX APACHE FLINK SAMZA APACHE BEAM (MILLWHEEL, DATAFLOW) ATHENAX STYLUS APACHE STORM STREAMALERT APACHE HERON Processing
  • 74. PROCESSING 74 TASK ISOLATIONLOW LATENCY MULTIPLE API'S MULTIPLE SEMANTICS DIVERSE WORKLOADS FAULT TOLERANCE KEY PROPERTIES
  • 76. HERON TOPOLOGY 76 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5
  • 77. HERON TOPOLOGY - PHYSICAL EXECUTION 77 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5 %% %% %% %% %%
  • 78. HERON GROUPINGS 78 01 Shuffle Grouping Random distribution of tuples / Fields Grouping Group tuples by a field or multiple fields All Grouping Replicates tuples to all tasks 02 . 03 - 04 Global Grouping Send the entire stream to one task ,
  • 79. HERON TOPOLOGY - PHYSICAL EXECUTION 79 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5 %% %% %% %% %% Shuffle Grouping Shuffle Grouping Fields Grouping Fields Grouping Fields Grouping Fields Grouping
  • 80. HERON TOPOLOGY - PHYSICAL EXECUTION 80 % % % % % Spout 1 Spout 2 Bolt 1 Bolt 2 Bolt 3 Bolt 4 Bolt 5 %% %% %% %% %% Shuffle Grouping Shuffle Grouping Fields Grouping Fields Grouping Fields Grouping Fields Grouping
  • 81. HERON ARCHITECTURE 81 Scheduler Topology 1 Topology 2 Topology N Topology Submission
  • 82. HERON ARCHITECTURE 82 Topology Master ZK Cluster Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan DATA CONTAINER DATA CONTAINER Metrics Manager Metrics Manager Metrics Manager Health Manager MASTER CONTAINER
  • 83. TOPOLOGY MASTER 83 Monitoring of containers Gateway for metrics Assigns role
  • 84. STREAM MANAGER 84 Routes tuples Implements backpressure Processing Semantics
  • 87. STREAM MANAGER BACK PRESSURE 87 Spout based back pressureTCP backpressure Stage by stage back pressure
  • 88. HERON INSTANCE 88 Runs only one task (spout/bolt) Exposes Heron API Collects several metrics API G
  • 90. HERON DEPLOYMENT 90 Topology 1 Topology 2 Topology N Heron Tracker Heron API Server Heron Web ZK Cluster Aurora Services Observability
  • 93. HERON TOPOLOGY SCALE 93 CONTAINERS - 1 TO 600 INSTANCES - 10 TO 6000
  • 94. HERON HAPPY FACTS :) 94 ✦ No more pages during midnight for Heron team ๏ Backlog quotas ๏ Soft isolation - flow control and throttling ✦ Very rare incidents for Heron customer teams ✦ Easy to debug during incidents for quick turn around ๏ Reduced resource utilization saving cost (~3x cost savings)
  • 96. HERON DEVELOPER ISSUES 96 01 02 Container resource alloca$on Parallelism tuning
  • 97. HERON OPERATIONAL ISSUES 97 01 02 03 Slow Hosts Network Issues Data Skew / . - 04 Load Variations , 05 SLA Violations /
  • 98. SLOW HOSTS 98 Memory Parity Errors Impending Disk Failures Lower GHZ
  • 100. NETWORK SLOWNESS 100 01 02 03 Delays processing Data is accumula$ng Timelines of results Is affected
  • 101. DATA SKEW 101 Multiple Keys Several keys map into single instance and their count is high Single Key Single key maps into a instance and its count is high H C
  • 103. SELF REGULATING STREAMING SYSTEMS 103 Automate Tuning SLO Maintenance Self Regula$ng Streaming Systems Tuning Manual, 6me-consuming and error-prone task of tuning various systems knobs to achieve SLOs SLO Maintenance of SLOs in the face of unpredictable load varia6ons and hardware or soWware performance degrada6on Self Regula$ng Streaming Systems System that adjusts itself to the environmental changes and con6nue to produce results
  • 104. SELF REGULATING STREAMING SYSTEMS 104 Self tuning Self stabilizing Self healing G !g Several tuning knobs Time consuming tuning phase The system should take as input an SLO and automatically configure the knobs. The system should react to external shocks and a u t o m a t i c a l l y reconfigure itself Stream jobs are long running Load variations are common The system should identify internal faults and attempt to recover from them System performance affected by hardware or software delivering degraded quality of service
  • 105. ENTER DHALION 105 Dhalion periodically executes well- specified policies that optimize execution based on some objective. We created policies that dynamically provision resources in the presence of load variations and auto-tune streaming applications so that a throughput SLO is met. Dhalion is a policy based framework integrated into Heron
  • 106. DHALION POLICY FRAMEWORK 106 Symptom Detector 1 Symptom Detector 2 Symptom Detector 3 Symptom Detector N .... Diagnoser 1 Diagnoser 2 Diagnoser M .... Resolver Invocation D iagnosis 1 Diagnosis 2 D iagnosis M Symptom 1 Symptom 2 Symptom 3 Symptom N Symptom Detection Diagnosis Generation Resolution Resolver 1 Resolver 2 Resolver M .... Resolver Selection Metrics
  • 108. DYNAMIC RESOURCE PROVISIONING 108 Pending Tuples Detector Backpressure Detector Processing Rate Skew Detector Resource Over provisioning Diagnoser Resource Under Provisioning Diagnoser Data Skew Diagnoser Resolver Invocation Diagnosis Symptoms Symptom Detection Diagnosis Generation Resolution Metrics Slow Instances Diagnoser Bolt Scale Down Resolver Bolt Scale Up Resolver Data Skew Resolver Restart Instances Resolver
  • 110. STREAMLIO 110 UNIFIED ARCHITECTURE Interactive Querying Storm API Streamlet SQL Application Builder Pulsar API BK/ HDFS API Kubernetes Metadata Management Operational Monitoring Chargeback Security Authentication Quota Management Kafka API
  • 111. STREAMLIO ARCHITECTURE 111 KEY HIGHLIGHTS DAG PROCESSING GEO-REPLICATION DURABILITY CONSISTENCY FUNCTION PROCESSING INFINITE RETENTION SQL
  • 113. DATA PROCESSING 113 CLEANSING AND TRANSFORMATION Data format consistency M:0:Male, F:1:Female Missing Data Imputation Pre-computation Sum, Max, Min, Distance Timestamp Hour of the day, Date
  • 114. DATA ENRICHMENT 114 MULTIPLE DATA SOURCES Operators Associate Aggregate Filter Properties Scalable Fault-tolerant Secure Challenges Encryption Merge Union Sort
  • 115. ETL DATA PIPELINES - RETAIL 115 Scenario Consumer retail company migrating multiple data and analytics applications to public cloud Challenges Needed a solution to connect data to dashboards, data warehouse, and applications Solution Cloud data pipeline built on Apache Pulsar to support data movement, transformation, and access
  • 116. ETL DATA PIPELINES - RETAIL 116 Cloud data warehouse Batch reporting and analytics Real-time dashboards and alerts Data sources and applications Messaging, queuing, data transformations Cloud data lake
  • 118. DATA SILOS 118 INTEGRATION Disparate Data Sources Multiple Data Types Different Frequency Data Fidelity
  • 119. DATA FUSION 119 EXTRACTING INSIGHTS Why? Complementarity → Completeness Redundancy → Accuracy Cooperative → Dependability Planning Decision-Making Applications Smart Buildings Remote sensing Transportation Multi-Target Tracking (MTT)
  • 120. 120 STREAMING JOINS CHALLENGES Bounded sliding window of tuples Traditional join semantics Hard to exploit many-core parallelism Sequential operation model (store and process) Linear data flow model (left-to-right / right-to-left) Performance ๏ Deployment parameters ❖ Processing capacity, level of parallelism ๏ Time-varying input parameters ❖ Varying sources, Rate of tuples SplitJoin * Guilsano et al., “Performance Modeling of Stream Joins”, DEBS 2017.
  • 121. ENTERPRISE EVENT BUS 121 Apache Pulsar Cluster Application 2 Application 5 Application 1 Application 3 Application 4 Application 6
  • 122. ENTERPRISE EVENT BUS - YAHOO! 122 Scenario Need to collect and distribute user and data events to distributed global applications at Internet scale Challenges ✦ Multiple technologies to handle messaging needs ✦ Multiple, siloed messaging clusters ✦ Hard to meet scale and performance ✦ Complex, fragile environment Solution ✦ Central event data bus using Apache Pulsar ✦ Consolidated multiple technologies and clusters into a single solution ✦ Fully-replicated across 8 global datacenter ✦ Processing >100B messages / day, 2.3M topics
  • 123. 123 UNBOUNDED UNORDERED EVENT TIME PROCESSING TIME LOW LATENCY LIMITED MEMORY SINGLE PASS , INCREMENTAL APPROXIMATE
  • 124. APPROXIMATION 124 Guaranteed bounds on approximation factors Error greater than 𝜖 with probability 𝛿 BOUNDS PROBABILISTICDETERMINISTIC
  • 126. 126 BIAS VS. VARIANCE PROPERTIES BIAS VARIANCE How much the average of the estimate differs from the true mean Variance of the estimate around its mean
  • 130. SAMPLING 130 Volume ๏ Unbounded Velocity Types ๏ Uniform ๏ Non-uniform ❖ Class imbalance Latency
  • 131. SAMPLING 131 ✦ Maintain dynamic sample ๏ A data stream is a continuous process ๏ Not known in advance how many points may elapse before an analyst may need to use a representative sample ✦ Reservoir sampling[1] ๏ Probabilistic insertions and deletions on arrival of new stream points ๏ Probability of successive insertion of new points reduces with progression of the stream ❖ An unbiased sample contains a larger and larger fraction of points from the distant history of the stream ✦ Practical perspective ๏ Data stream may evolve and hence, the majority of the points in the sample may represent the stale history [1] J. S. Vi`er. Random Sampling with a Reservoir. ACM Transac6ons on Mathema6cal SoWware, Vol. 11(1):37–57, March 1985. OBTAINING A REPRESENTATIVE SAMPLE
  • 132. SAMPLING 132 ✦ Sliding window approach* (sample size k, window width n) ๏ Sequence-based ❖ Replace expired element with newly arrived element ‣ Disadvantage: highly periodic ❖ Chain-sample approach ‣ Select element ith with probability Min(i,n)/n ‣ Select uniformly at random an index from [i+1, i+n] of the element which will replace the ith item ‣ Maintain k independent chain samples ๏ Timestamp-based ❖ # elements in a moving window may vary over time ❖ Priority-sample approach * B. Babcock. Sampling From a Moving Window Over Streaming Data. In Proceedings of SODA, 2002. 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
  • 133. SAMPLING 133 COMPRESSED SENSING Distributed ๏ Communication cost ๏ Non-uniform sampling rates High dimensionality ๏ Sparsity Applications Sensor networks Network traffic monitoring Facial recognition Mitigate latency impact
  • 134. 134 COMPRESSED SENSING EARLY WORK [Candès et al. 2006] [Donoho 2006] [Cormode and Muthukrishnan 2006] [Freris et al. 2013] Recursive approach COMPRESSED SENSING FOR STREAMING DATA [Yan et al. 2015] DISTRIBUTED OUTLIER DETECTION COMPRESSIVE SAMPLING [Candès and Wakin 2008] RELATED WORKS
  • 135. 135
  • 138. PERSONALIZATION 138 THE HOLY GRAIL Continuously evolving data streams Time decay Multi-armed bandit Contextual Deep learning Privacy
  • 139. INVENTORY MANAGEMENT 139 FORECASTING Mining retail transactions Seasonality Virality New product launches
  • 141. 141 HERBERT SIMON “A wealth of information creates a poverty of attention.”
  • 142. 142 CONTINUOUS (PREFERENCE) TOP-K ELEMENTS * Moura6dis et al., “Con8nuous monitoring of top-k queries over sliding windows”, SIGMOD 2006. # Yu et al., “Processing a large number of con8nuous preference top-k queries”, SIGMOD 2012. ^ Zhang et al., “Reverse k-ranks query”, VLDB 2014.
  • 146. Facebook Live, Periscope, Twitch 146 TRENDING LIVE VIDEO
  • 148. 148 BIAS VARIOUS SOURCES Data bias ๏ Public image datasets ❖ Geographic representation ๏ Induces algorithmic bias[1] ❖ Online advertising ❖ Jobs recommendation ❖ Customer profiling Anomalies Data fidelity ๏ Bad actors ❖ Bots ❖ Data farms * Figure borrowed from Baeza-Yates, “Bias on the web”, 2018. [1] Hajian et al., “Algorithmic bias: From Discrimina8on Discovery to Fairness-aware Data mining”, KDD 2016. *
  • 150. REAL TIME TRENDING IN PULSAR & HERON 150 Streamlio (Apache Pulsar and Apache Heron) Data Source 2 clean-fn 2 Data Source 1 Data Source 3 clean-fn 1 trend- topology 3 Trending Application T1 T2 T3
  • 152. THE TAIL[1] 152 CLASSIFICATION[2] ✦ Whether elements can only be added or can be both inserted and deleted? ๏ Cash register model ๏ Turnstile model ✦ What operations are allowed on the elements? ๏ Comparison model ๏ Fixed-universe model ✦ Whether the algorithm is deterministic/randomized? ๏ Monte Carlo randomization QUANTILE ESTIMATION [1] J. Deam and L. Barroso, “The tail at scale”, 2013. [2] G. Luo et al., “Quan8les over data streams: experimental comparisons, new analyses, and further improvements”, 2016.
  • 153. QUANTILE ESTIMATION 153 [Arasu and Manku 2005] SLIDING WINDOWS [Cormode et al. 2005] [Yi et al. 2009] CONTINUOUS MONITORING [Shrivastava et al. 2004] [Agarwal et al. 2012] DISTRIBUTED DATA [Cormode et al. 2006] BIASED FLAVORS * Wang et al., “Quan8les over Data Streams: An Experimental Study”, SIGMOD 2013.
  • 154. QUANTILE ESTIMATION 154 PICK [Blum et al. 1973] EXACT MEDIAN [Munro and Patterson 1980] Q-DIGEST [Shrivastava et al. 2002] Interest in this problem may be traced to the realm of sports and the design of (traditionally, tennis) tournaments to select the first- and second-best players. In 1883, Lewis Carroll published an article denouncing the unfair method by which the second- best player is usually determined in a "knockout tournament" -- the loser of the final match is often not the second-best! (Any of the players who lost only to the best player may be second-best.) Around 1930, Hugo Steinhaus brought the problem into the realm of algorithmic complexity by asking for the minimum number of matches required to (correctly) select both the first- and second-best players from a field of n contestants. I T-DIGEST [Dunning and Ertl 2017] p passes space
  • 155. Max signal value # Elements Compression Factor Complete binary tree Q-DIGEST[1] 155 ✦ Groups values in variable size buckets of almost equal weights ๏ Unlike a traditional histogram, buckets can overlap ✦ Key features ๏ Detailed information about frequent values preserved ๏ Less frequent values lumped into larger buckets ✦ Using message of size m, answer within an error of ✦ Except root and leaf nodes, a node v ∈ q-digest iff [1] Shrivastava et al., “Medians and Beyond: New Aggrega8on Techniques for Sensor Networks”. SenSys, 2004.
  • 156. Q-DIGEST (CONTD.) 156 ✦ Building a q-digest ✦ q-digests can be constructed in a distributed fashion ๏ Merge q-digests
  • 157. T-DIGEST[1] 157 [1] T. Dunning and O. Ertl, ”Compu8ng Extremely Accurate Quan8les using t-digests”, 2017. h`ps://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf ✦ Approximation of rank-based statistics ๏ Compute quantile q with an accuracy relative to max(q, 1-q) ๏ Compute hybrid statistics such as trimmed statistics ✦ Key features ๏ Robust with respect to highly skewed distributions ๏ Independent of the range of input values (unlike q-digest) ๏ Relative error is bounded ❖ Non-equal bin sizes ❖ Few samples contribute to the bins corresponding to the extreme quantiles ๏ Merging independent t-digests ❖ Reasonable accuracy
  • 158. T-DIGEST (CONTD.) 158 ✦ Group samples into sub-sequences ๏ Smaller sub-sequences near the ends ๏ Larger sub-sequences in the middle ✦ Scaling function ๏ Mapping k is monotonic ❖ k(0) = 1 and k(1) = δ ❖ k-size of each subsequence < 1 s Notional Index Compression parameter Quantile
  • 159. T-DIGEST (CONTD.) 159 ✦ Estimating quantile via interpolation ๏ Sub-sequences contain centroid of the samples ๏ Estimate the boundaries of the sub-sequences ✦ Error ๏ Scales quadratically in # samples ๏ Small # samples in the sub-sequences near q=0 and q=1 improves accuracy ๏ Lower accuracy in the middle of the distribution ❖ Larger sub-sequences in the middle ✦ Two flavors ๏ Progressive merging (buffering based) and clustering variant s
  • 160. 160 RELATED WORKS SUMMARY DATA STRUCTURE [Greenwald and Khanna 2001] [Luo et al. 2016] RANDOM BQ-SUMMARY [Zhang et al. 2006] One-scan randomized MR [Cormode et al. 2006] A SMALL SNAPSHOT
  • 161. 161
  • 162. PERFORMANCE 162 MONITORING CPM (Cost per 1000 Impressions), vCPM, CPCV CPC (Cost per Click) CPI (Cost per Install) CPE (Cost per Engagement) CPA (Cost per Action), CPO
  • 165. FRAUD DETECTION USING APACHE PULSAR 165 Apache Pulsar Data Source 2 clean-fn 2 Data Source 1 Data Source 3 clean-fn 1 fraud-fn 1 Fraud Detection T1 T2 T3 fraud-fn 2
  • 166. IP/DEVICE ID BLACKLISTING 166 CUCKOO FILTER [Fan et al. CoNext 2014] QUOTIENT FILTER [Bender et al. Hot Storage 2011] MORTON FILTER [Breslow and Jayasena, VLDB 2018] BLOOM FILTER [Bloom, CACM 1970] FILTERING
  • 168. BLOOM FILTER 168 ✦ Natural generalization of hashing ✦ False positives are possible ✦ No false negatives No deletions allowed ✦ For false positive rate ε, # hash functions = log2(1/ε) where, n = # elements, k = # hash functions m = # bits in the array
  • 169. VARIATIONS 169 BLOOM FILTER [Mitzenmacher 2002] COMPRESSED [Fan et al. 2000] COUNTING [Cohen and Matias 2003] SPECTRAL [Chazelle et al. 2004] BLOOMIER FILTER [Guo et al. 2010] BUFFERED [Lall and Ogihara, 2007] BITWISE [Yoon 2010] A2 [Canim et al. 2010] DYNAMIC [Einziger and Friedman, 2014] BLOOM-g [Debnath et al. 2011] Bloomflash [Naor and Yogev, 2013] SLIDING [Qiao et al. 2014] Tinyset
  • 170. QUOTIENT FILTER 170 ✦ Supports insertion, deletion ✦ Merging/resizing feasible ✦ Lookups incur a single cache miss ✦ 20% bigger than Bloom Filter ✦ Stores p-bit fingerprints of elements ✦ Compact hash table ✦ P(hard collision) = Employs quotienting: remainder (r least significant bits) is stored in the bucket indexed by the quotient (q most significant bits) where, ⍺ = n/m, m=2q , n: # elements, m: # slots
  • 171. VARIATIONS 171 QUOTIENT FILTER [Bender et al. 2011] CASCADE [Pandey et al. 2017] COUNTING RANK-AND-SELECT ✦ Multiset of integer, each of width p-bits ✦ l =log2( n/M) + O(1) in-flash QFs of exponentially increasing size, QF1, QF2, …, QFl stored contiguously, M: RAM size ✦ Search: O(log(n/M)) block reads ✦ Insert: O((log( n/M))/B) amortized block writes/erases, B: natural block size of the flash [Dutta et al. 2014] STREAMING [Bungeroth, 2018]
  • 172. CUCKOO FILTER 172 ✦ Key Highlights ๏ Add and remove items dynamically ๏ For false positive rate ε < 3%, more space efficient than Bloom filter ๏ Higher performance than Bloom filter for many real workloads ๏ Asymptotically worse performance than Bloom filter ❖ Min fingerprint size α log (# entries in table) ✦ Overview ๏ Stores only a fingerprint of an item inserted ❖ Original key and value bits of each item not retrievable ๏ Set membership query for item x: search hash table for fingerprint of x
  • 173. Cuckoo Hashing [1] CUCKOO FILTER 173 [1] R. Pagh and F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122-144, 2004. [2] Illustra6on borrowed from “Fan et al. Cuckoo Filter: Prac6cally Be`er Than Bloom. In Proceedings of the 10th ACM Interna6onal on Conference on Emerging Networking Experiments and Technologies, 2014.” [2] Illustra6on of Cuckoo hashing [2] ✦ High space occupancy ✦ Practical implementations: multiple items/bucket ✦ Example uses: Software-based Ethernet switches Cuckoo Filter [2] ✦ Uses a multi-way associative Cuckoo hash table ✦ Employs partial-key cuckoo hashing ๏ Store fingerprint of an item ๏ Relocate existing fingerprints to their alternative locations [2]
  • 174. CUCKOO FILTER 174 ✦ Deletion ๏ Item must have been previously inserted ✦ Partial-key cuckoo hashing ✦ Fingerprint hashing ensures uniform distribution of items in the table ✦ Length of fingerprint << Size of h1 or h2 ✦ Possible to have multiple entries of a fingerprint in a bucket Alternate bucket Significantly shorter than h1 and h2
  • 175. CONCURRENT CUCKOO HASHING VARIATIONS 175 CUCKOO HASHING AND CUCKOO FILTER [Li et al. 2014] [Eppstein et al. 2017] 2-3 CUCKOO FILTER HORTON TABLES [Sun et al. 2017] SMART CUCKOO [Breslow et al. 2016]
  • 176. MORTON FILTER 176 ✦ Key Highlights ๏ Insertions heavily biased towards the hash function H1 ❖ Fingerprint retrieval require fewer hardware cache accesses ๏ Overflow Tracking Array (OTA): simple bit vector ❖ Tracks when fingerprints cannot be placed using H1 ❖ Most negative lookups only require accessing a single bucket ๏ Fullness counters ❖ Replace the empty slots (compression) and track the load of each logical bucket ❖ Reads and updates happen in-situ without explicit need for materialization ๏ Most insertions require accessing a single cache line ๏ Fewer fingerprint comparisons than a Cuckoo filter
  • 177. MORTON FILTER 177 ✦ Hashing ๏ Reduces TLB misses, DRAM row buffer misses and page faults ๏ Does not require total # buckets to be a power of 2 K’s fingerprint Buckets per block Total buckets (multiple of 2) Bucket index where K’s fingerprint is stored Hash function ๏ Compute H’(K) and remap K’s fingerprint without knowing K itself ๏ For any key, candidate buckets mostly fall within the same page of memory Power of 2
  • 178. SPEND 178 DYNAMIC (RE)-ALLOCATION SANs Facebook, Google, Twitter, Yahoo, Snapchat Ad Networks Affiliates Offer wall TV
  • 179. 179
  • 181. USER PROFILE 181 MULTI-DIMENSIONAL TARGETING Platform Device Type OS Version Apps Installed Age, Gender
  • 182. STATISTICAL ARBITRAGE 182 PREDICTION Click Through Rate (CTR) Click to Install (CTI) Life Time Value (LTV) Churn probability REAL-TIME ANOMALY DETECTION
  • 183. ANOMALY DETECTION 183 Temporal Distribution based A N O M A L Y DIFFERENT FLAVORS safaribooksonline.com/library/view/understanding-anomaly-detection/9781491983676/ “On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al. 2017. (https://arxiv.org/abs/1710.04735)
  • 184. ANOMALY DETECTION 184 ROOTED IN STUDIES IN PROBABILITY 1777 1713 Reconciling discrepant observations Euler 1749 Irregularities in the motion of Saturn and Jupiter Mayer 1750 Study of lunar libration Boscovich 1755 Measurements of the mean ellipticity of the Earth
  • 185. ANOMALY DETECTION 185 LONG HISTORY First formalization of an exact rule for rejection of anomalies
  • 187. ANOMALY DETECTION 187 WHY IT’S NON-TRIVIAL SEASONALITY TREND BREAKOUTSTATIONARITYNOISE
  • 188. ANOMALY DETECTION 188 ASSUMPTIONS Normality, Stationarity MOVING AVERAGES Params: Width, Decay RULE BASED µ ± σ COMMON APPROACHES
  • 189. ANOMALY DETECTION 189 MAD[1] Median Absolute Deviation MEDIAN MCD[2] Minimum Covariance Determinant MVEE[3,4] Minimum Volume Enclosing Ellipsoid ROBUST MEASURES [1]P. J. Rousseeuw and C. Croux, “Alterna6ves to the Median Absolute Devia6on”, 1993. [2] h`p://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract [3] P. J. Rousseeuw and A. M. Leroy.,“Robust Regression and Outlier Detec6on”, 1987. [4] M. J.Todda and E. A. Yıldırım , “On Khachiyan's algorithm for the computa6on of minimum-volume enclosing ellipsoids”, 2007.
  • 190. ANOMALY DETECTION 190 CONTEXT ✦ Data types ๏ Text, Audio, Video ✦ Data veracity ๏ Wearables ๏ Smart cities, Connected Home ๏ Internet of Things LIVE DATA ✦ Multi-dimensional ✦ Mow memory footprint ✦ Accuracy-speed tradeoff CHALLENGES
  • 191. 191 ANOMALY DETECTION DIFFERENT APPLICATION DOMAINS RADIO ANOMALY DETECTION MARKER DISCOVERY CROWDED SCENE ANALYSIS ANOMALOUS HUMAN BEHAVIOR HUMAN TRAJECTORY PREDICTION [Popoola and Wang 2012] [Alahi et al. 2016] [O’Shea et al. 2016] [Schegl et al. 2017] [Sabokrou et al. 2017]
  • 193. THREAT DETECTION 193 POTENTIAL HIGH BUSINESS DOWNSIDE DDos Attack Network Intrusion Harassement Account hacking Identity theft Purchases
  • 194. MONITORING 194 AUDIT LOGS Answering the “What, When, Who, Why” Downloading of unusual amount of data Troubleshoot systems, networks Real-time packet analysis Challenge: Encryption Suspicious access pattern
  • 195. REAL TIME SECURITY USING APACHE PULSAR 195 Apache Pulsar Data Source 2 threat-fn 2 Data Source 1 Data Source 3 threat-fn 1 Query Audit Logs (SQL) Streaming threats & anomalies T1 T2 T3 Audit Log Investigation
  • 196. DEVICE IDS/IP ADDRESSES 196 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS ✦ Require large amount of memory ✦ Long tail - low business value EXACT ✦ Types ๏ Counter-based ❖ Example: Frequent Algorithm [Demaine et al. 2002] ๏ Sketch-based ❖ Example: GroupTest [Cormode and Muthukrishnan, 2002] ✦ Parameters ๏ # counters, # update operations, error APPROXIMATE
  • 197. FLAVORS 197 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS l 1 heavy hitter ๏ m can be known/unknown ๏ Deterministic ❖ [Misra and Gries 1982, Karp et al. 2003] ๏ Randomized ❖ [Estan and Varghese 2003, Kumar and Xu 2006] [1] Woodruff, 2016. “New Algorithms for Heavy Hi`ers in Data Streams”.ICDT. Frequency of item i # elements in stream Output
  • 198. FLAVORS 198 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS l 2 heavy hitter ๏ Example ❖ [Charikar et al. 2004] ๏ Applications ❖ Compressed sensing ❖ Numerical linear algebra ❖ Cascaded aggregates Frequency of item i # dis6nct items Output
  • 199. COUNT-MIN SKETCH[1] 199 ✦ A two-dimensional array counts with w columns and d rows ✦ Each entry of the array is initially zero ✦ d hash functions are chosen uniformly at random from a pairwise independent family ✦ Update ๏ For a new element i, for each row j and k = hj(i), increment the kth column by one ✦ Point query where, sketch is the table ✦ Parameters [1] Cormode, Graham; S. Muthukrishnan (2005). "An Improved Data Stream Summary: The Count-Min Sketch and its Applica6ons". J. Algorithms 55: 29–38. ),( δε ! ! " # # $ = ε e w ! ! " # # $ = δ 1 lnd }1{}1{:,,1 wnhh d ……… →
  • 200. COUNT-MIN SKETCH 200 ✦ Count-Min sketch with conservative update (CU sketch) ๏ Update an item with frequency c ๏ Avoid unnecessary updating of counter values => Reduce over-estimation error ๏ Prone to over-estimation error on low-frequency items ✦ Lossy Conservative Update (LCU) - SWS ๏ Divide stream into windows ๏ At window boundaries, ∀ 1 ≤ i ≤ w, 1 ≤ j ≤ d, decrement sketch[i,j] if 0 < sketch[i,j] ≤ [1] Cormode, G. 2009. Encyclopedia entry on ’Count-MinSketch’. In Encyclopedia of Database Systems. Springer., 511–516. VARIANTS [1]
  • 201. COUNTER TREE[1] 201 ✦ Motivation ๏ Equi-bitwidth counters → inefficient space usage owing variance in counts ✦ Two-dimensional counter sharing ✦ m virtual counters (V[i], 0 ≤ i < m) ๏ Path from a leaf node to the root ๏ Counter at layer j in V[i] given by: [1] Chen et al., “Counter Tree: A Scalable Counter Architecture for Per-FLow Traffic Measurement”, TON 2017. # leaf nodes Total space (# bits) Counter bit width Tree height Degree of a non-leaf node
  • 202. COUNTER TREE (CONTD.) 202 ✦ Counting range ✦ Virtual counter array Vf ๏ Vf[i] = V[hi(f)], 0 ≤ i < r, ✦ Recording ๏ Increment a counter in Vf, chosen uniformly at random, by one ๏ May result in update of multiple counters # counters in the bo`om layer
  • 203. COUNTER TREE (CONTD.) 203 ✦ Estimation ๏ CT based Estimation (CTE) ❖ k = dh-1 ❖ Xi is the value of the subtree rooted at C[v], where C[v] is the component counter of V[u] at the highest level and ๏ CT based Maximum Likelihood Estimation (CTM) Reduce positive bias (overestimation)
  • 204. VARIATIONS 204 FREQUENT/TOP-K ELEMENTS, HEAVY HITTERS LOSSY COUNTING [Manku and Motwani, 2002] [Homem and Carvalho, 2010] FILTERED SPACE SAVING COUNT-SKETCH [Dimitropoulos et al. 2008] PROBABILISTIC LOSSY COUNTING [Charikar et al, 2002] [Pitel and Fouquier, 2015] COUNT-MIN-LOG SPACE SAVING ALGORITHM [Metwally et al. 2005] FDPM [Yu etal., 2006] [Sivaraman et al. 2017] HASHPIPE
  • 205. 205
  • 206. HEALTH CARE 206 WEARABLES Smart watches, smart glasses, smart clothing, implantables Data veracity ๏ Hear rate (ECG and HRV) ๏ Brainwave (EEG) ๏ Muscle bio-signals (EMG) ๏ Blood pressure ๏ Calories burned ๏ Body temperature ๏ Steps walked Other applications ๏ Blood alcohol content ๏ Athletic performance
  • 207. MONITORING 207 OIL DRILLING Early detection of leaks and/or spills ๏ Oil ๏ Flammable gases ๏ Hazardous chemicals Equipment health ๏ Corrosion Operational metrics ๏ Pressure ๏ Temperature ๏ Flow ๏ Air quality
  • 208. QUALITY OF SERVICE 208 CUSTOMER FOCUS WiFi, Video streaming, E-commerce Metrics ๏ Availability ๏ Throughput ๏ Response time ๏ Document Complete ๏ Jitter ๏ Packet loss ratio ๏ Delivery success rate (messaging) ๏ Location accuracy Common issues ๏ Anomalies, Breakouts
  • 209. INDUSTRIAL IOT USING APACHE PULSAR 209 Apache Pulsar Cloud Data Source 2 clean-fn 2 Data Source 1 Data Source 3clean-fn 1 logic-fn 1 T1 T2 T3 logic-fn 2 Apache Pulsar Edge filter-fn 1 filter-fn 1
  • 210. NUMBER OF DISTINCT DATA SOURCES 210 CARDINALITY ESTIMATION ✦ Hash values as strings ✦ Occurrence of particular patterns in the binary representation ✦ Example: Hyperloglog [Flajolet et al. 2008] BIT-PATTERN OBSERVABLES ✦ Hash values as real numbers ✦ k-th smallest value ๏ Insensitive to distribution of repeated values ✦ Examples: MinCount [Giroire, 2000] ORDER STATISTIC OBSERVABLES
  • 211. CARDINALITY ESTIMATION 211 ANOTHER CLASSIFICATION SKETCH-BASED Scan the entire data set once Hash the items Create sketch SAMPLING Potentially large estimation error UNIFORM HASHING Employs a uniform hash function LOGARITHMIC HASHING Keeps track of the most uncommon element observed so far Example: Hyperloglog INTERVAL BASED How packed an interval is? BUCKET BASED P(bucket is non-empty)
  • 212. HYPERLOGLOG 212 where ✦ Apply hash function h to every element in a multiset ✦ Cardinality of multiset is 2max(ϱ) where 0ϱ-11 is the bit pattern observed at the beginning of a hash value ✦ Above suffers with high variance ๏ Employ stochastic averaging ๏ Partition input stream into m sub-streams Si using first p bits of hash values (m = 2p)
  • 213. HYPERLOGLOG 213 OPTIMIZATIONS ✦ Use of 64-bit hash function ๏ Total memory requirement 5 * 2p -> 6 * 2p, where p is the precision ✦ Empirical bias correction ๏ Uses empirically determined data for cardinalities smaller than 5m and uses the unmodified raw estimate otherwise ✦ Sparse representation ๏ For n≪m, store an integer obtained by concatenating the bit patterns for idx and ϱ(w) ๏ Use variable length encoding for integers that uses variable number of bytes to represent integers ๏ Use difference encoding - store the difference between successive elements ✦ Other optimizations [1, 2] [1] h`p://druid.io/blog/2014/02/18/hyperloglog-op6miza6ons-for-real-world-systems.html [2] h`p://an6rez.com/news/75
  • 214. MIN-COUNT VARIATIONS 214 CARDINALITY ESTIMATION [Giroire, 2009] Optimal statistical efficiency [Ting, 2014] DISCRETE MAX-COUNT F0 AND L0 ESTIMATION [Chen et al. 2011] S-BITMAP [Kane et al. 2010] Optimal space complexity
  • 215. WRAPPING UP 215 PLATFORM UNIFICATION APPLICATIONS ANALYTICS
  • 219. READINGS 219 Data Streams: Algorithms and Applications”, by M. Muthukrishnan “Sketching as a Tool for Numerical Linear Algebra”, by D. Woodruff “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, by G. Cormode, M. Garofalakis and P. J. Haas “Data Streams: Models and Algorithms”, by C. Aggarwal. “Graph Streaming Algorithms”, by A. McGregor “Data Sketching”, by G. Cormode.
  • 220. READINGS 220 “The algebra of stream processing functions”, TCS 2001. STREAM CALCULUS “A conductive calculus of streams”, MSCS 2005. “Temporal stream algebra”, 2012. “Stream processing col-algebraically”, SCP 2013. “An algebra for pattern matching, time aware aggregates and partitions on relational data streams”, DEBS 2015.
  • 221. READINGS 221 “Streaming Algorithms for Robust Distinct Elements”, SIGMOD 2016. “Pyramid Sketch”, VLDB 2017. “Bias-Aware Sketches”, VLDB 2017. “Efficient Adaptive Detection of Complex Event Patterns”, VLDB 2018. “Cardinality Estimation: An Experimental Survey”, VLDB 2018. “Augmented Sketch: Faster and More Accurate Stream Processing”, SIGMOD 2016. “BPTree: an L2 heavy hitters algorithm using constant memory”, PODS 2017. “A comparative analysis of state-of-the-art SQL- on-Hadoop systems for interactive analytics”, BigData 2018. “Unsupervised real-time anomaly detection for streaming data”, Neurocomputing2017. “Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams”, SIGMOD 2016.
  • 222. 222 READINGS “Streaming Anomaly Detection Using Randomized Matrix Sketching”, VLDB 2015. “Graph Sketches: Sparsification, Spanners, and Subgraphs”, PODS 2012. “Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams”, SIAM JoC 2009. “Querying and mining data streams: You only get one look”, SIGMOD 2002. “Models and Issues in Data Stream Systems”, PODS 2002. “Clustering Data Streams”, FOCS 2000. “Persistent Data Sketching”, VLDB 2015. “SCREEN: Stream Data Cleaning under Speed Constraints”, SIGMOD 2015. “An optimal algorithm for the distinct elements problem”, PODS 2010. “Statistical Analysis of Sketch Estimators”, SIGMOD 2007.
  • 223. OPEN SOURCE 223 Data Sketches (https://datasketches.github.io/) Algebird (https://github.com/twitter/algebird) StreamDM (http://huawei-noah.github.io/streamDM/) stream (https://cran.r-project.org/web/packages/stream/) FluRS (https://github.com/takuti/flurs)
  • 224. RESOURCES 224 https://www.slideshare.net/arunkejariwal/correlation-analysis-on-live-data-streams (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/modern-realtime-streaming-architectures (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265 (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/anomaly-detection-in-realtime-data-streams-using-heron (O’Reilly Strata) https://www.slideshare.net/arunkejariwal/real-time-analytics-algorithms-and-systems (VLDB 2015)
  • 225. RESOURCES 225 http://www.cs.toronto.edu/~koudas/courses/csc2508/SQL-on-Hadoop-Final.pdf (VLDB 2015) https://www.cse.ust.hk/vldb2002/program-info/tutorial-slides/T5garofalalis.pdf (VLDB 2002) https://datasketches.github.io/docs/Research.html https://people.cs.umass.edu/~mcgregor/slides/07-nipsslides.pdf (NIPS 2007) http://dimacs.rutgers.edu/~graham/pubs/slides/streammining.pdf