2. (Batch) Analytics
Scientists are doing this for 25 year with
MPI (1991) on special Hardware
Took off with Google’s MapReduce
paper (2004), Apache Hadoop, Hive and
whole eco system created.
It was successful, So we are here!!
But, processing takes time.
3. Value of Some Insights degrade Fast!
For some usecases ( e.g. stock markets, traffic, surveillance, patient
monitoring) the value of insights degrade very quickly with time.
- E.g. stock markets and speed of light
We need technology that can produce
outputs fast
- Static Queries, but need very fast output
(Alerts, Realtime control)
- Dynamic and Interactive Queries ( Data
exploration)
4. History
Realtime Analytics are not new either!!
- Active Databases (2000+)
- Stream processing (Aurora, Borealis (2005+)
and later Storm)
- Distributed Streaming Operators (e.g.
Database research topic around 2005)
- CEP vendor roadmap ( from
http://www.complexevents.com/2014/12/03/cep-
tooling-market-survey-2014/)
7. I. Stream Processing
Program a set of processors and wire them up, data flows though
the graph.
A middleware framework handles data flow, distribution, and fault
tolerance (e.g. Apache Storm, Samza)
Processors may be in the same machine or multiple machines
9. III. Micro Batch
Process data in small batches, and
then combine results for final results
(e.g. Spark)
Works for simple aggregates, but
tricky to do this for complex
operations (e.g. Event Sequences)
Can do it with MapReduce as well if
the deadlines are not too tight.
10. IV. OLAP Style In Memory Computing
Usually done to support interactive
queries
Index data to make them them
readily accessible so you can respond
to queries fast. (e.g. Apache Drill)
Tools like Druid, VoltDB and SAP
Hana can do this with all data in
memory to make things really fast.
11. Realtime Analytics Patterns
Simple counting (e.g. failure count)
Counting with Windows ( e.g. failure count every hour)
Preprocessing: filtering, transformations (e.g. data cleanup)
Alerts , thresholds (e.g. Alarm on high temperature)
Data Correlation, Detect missing events, detecting erroneous data
(e.g. detecting failed sensors)
Joining event streams (e.g. detect a hit on soccer ball)
Merge with data in a database, collect, update data conditionally
12. Realtime Analytics Patterns (contd.)
Detecting Event Sequence Patterns (e.g. small transaction followed
by large transaction)
Tracking - follow some related entity’s state in space, time etc. (e.g.
location of airline baggage, vehicle, tracking wild life)
Detect trends – Rise, turn, fall, Outliers, Complex trends like triple
bottom etc., (e.g. algorithmic trading, SLA, load balancing)
Learning a Model (e.g. Predictive maintenance)
Predicting next value and corrective actions (e.g. automated car)
13. Apache Hive
A SQL like data processing language
Since many understand SQL, Hive
made large scale data processing Big
Data accessible to many
Expressive, short, and sweet.
Define core operations that covers 90%
of problems
Lets experts dig in when they like!
15. CEP = SQL for Realtime Analytics
Easy to follow from SQL
Expressive, short, and sweet.
Define core operations that covers 90% of
problems
Lets experts dig in when they like!
Lets look at the core operations.
16. Operators: Filters
Assume a temperature stream
Here weather:convertFtoC() is a
user defined function. They are
used to extend the language.
define stream TempStream (ts long, temp double);
from TempratureStream [weather:convertFtoC(temp) > 30.0)
and roomNo != 2043]
select roomNo, temp
insert into HotRoomsStream ;
Usecases:
- Alerts , thresholds (e.g. Alarm on
high temperature)
- Preprocessing: filtering,
transformations (e.g. data cleanup)
17. Operators:Windows and Aggregation
Support many window types
- Batch Windows, Sliding windows, Custom windows
Usecases
- Simple counting (e.g. failure count)
- Counting with Windows ( e.g. failure count every hour)
from TempratureStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
insert into HotRoomsStream ;
18. Operators: Patterns
Models a followed by relation: e.g.
event A followed by event B
Very powerful tool for tracking
and detecting patterns
from every (a1 = TempratureStream)
-> a2 = TempratureStream [temp > a1.temp + 5 ]
within 1 day
select a2.ts as ts, a2.temp – a1.temp as diff
insert into HotDayAlertStream;
Usecases
- Detecting Event Sequence Patterns
- Tracking
- Detect trends
19. Operators: Joins
Join two data streams based on a condition and windows
Usecases
- Data Correlation, Detect missing events, detecting erroneous data
- Joining event streams
from TempStream[temp > 30.0]#window.time(1 min) as T
join RegulatorStream[isOn == false]#window.length(1) as R on
T.roomNo == R.roomNo
select T.roomNo, R.deviceID, ‘start’ as action insert into
RegulatorActionStream
20. Operators:Access Data from the Disk
Event tables allow users to map a database to a window and join a
data stream with the window
Usecases
- Merge with data in a database, collect, update data conditionally
define stream TempStream (ts long, temp double);
define table HistTempTable(day long, avgT double);
from TempStream #window.length(1) join OldTempTable
on getDayOfYear(ts) == HistTempTable.day && ts > avgT
select ts, temp
insert into PurchaseUserStream ;
22. Predictive Analytics
Build models and use them with
WSO2 CEP, BAM and ESB using
upcoming WSO2 Machine Learner
Product ( 2015 Q2)
Build model using R, export them as
PMML, and use within WSO2 CEP
Call R Scripts from CEP queries
Regression and Anomaly Detection
Operators in CEP
23. Case Study: Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
26. Idea 1: Network of CEP Nodes
For scaling, we arrange CEP
processing nodes in a graph like with
stream processing.
The Graph can be implemented
using an stream processing engine
like Apache Storm
27. Idea II: Compile SQL like Queries to a
Network of CEP Nodes
from TempStream[temp > 33]
insert into HighTempStream;
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;
28. How do We partition the Data to scale
up the Analysis?
Lets follow MapReduce
Map Reduce does not scale itself, it asks users to break
the problem to many small independent problems.
29. Idea III: Let the Users specify Parallelism
Language include parallel constructs:
partitions, pipelines, distributed
operators
Assign each partition to a different
node, and partition the data accordingly
define partition on TempStream.region {
from TempStream[temp > 33]
insert into HighTempStream;
}
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;
30. Handling Ordering
When the data processed in
parallel, output might be generated
out of order.
Due to lack of a global time, we
cannot trigger windows and other
time sensitive constructs
Solution: the current time needs to
be propagated though the graph
33. CEP = SQL for Realtime Analytics
Easy to follow from SQL
Expressive, short, sweet and fast!!
Define core operations that covers 90% of
problems
Lets experts dig in when they like!
And it Scales!!