Slidedeck from the InfoFarm Real Time Big Data Seminar. Main Topics are: Apache Kafka, Apache Spark, Apache Storm and integration and visualisations with Elasticsearch and Kibana.
4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda
• Typical Big Data Landscape
• The need for Real Time Big Data
• Real Time Data Ingestion
• Tools for Real Time Big Data
– Apache Spark
– Apache Storm
– Search
• Q&A
• Lunch
5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
A Typical Big Data Landscape
6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
A Typical Big Data Landscape
7. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
A Typical Big Data Landscape
• Data Silo
• Batch environment
• Periodical Analytics/statistics
• Data Source for new systems
8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
The need for Real Time Big Data
• Obtaining analytical results faster
– Processing faster than once a day
• Load evens out over day
• Past/Present/Future
– Alert for certain events
– Updating Prediction models on-the-fly
• Allow faster feedback to end users
– See results of your actions right away
9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Perfect fits for Real Time Processing
• Anomaly Detection
– Abnormal readings of sensors
– Abnormal amounts of log files
– Fraud detection
• Real Time updates to Recommender models
– Fast new recommendations in e-commerce
– Support for trending items
– Fast responses to events happening right now
• Real Time updates of clustering models
• Improving Classification based un current events
• Can be run side-by-side with traditional historical models
15. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Kafka - Overview
• Producers write
messages to Kafka
topics
• Consumers process
messages from a
topic
• Kafka runs on a
cluster of server
where each server
is called a broker
16. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Kafka - Topics
• Topics are split up in
different partitions
• Partitions are
replicated across the
cluster
• Order of messages is
guaranteed
• Messages are stored
for a period of time
• Producers decide
which partition they
write to
• Consumers keep the
offset of which
messages they have
read
24. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spouts
• Source of streams into the topology
• Can be reliable or unreliable
• Support for:
– Kafka
– Kestrel
– RabbitMQ
– JMS
– Amazon Kinesis
– Build your own (e.g. twitter)
26. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Bolts
• Where all the processing happens
• Filtering, functions, aggregations, joins,
database updates, …
• You subscribe to streams of a different
component (other bolts/spouts)
• Must ack every tuple they process
27. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Parallelism
• Spouts & Bolts
actually run as
multiple instances
on different
machines
• Making sure that
the correct
messages goes to
the correct instance
is up to the
developer
29. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Stream Groupings
• Defines how a stream should be
partitioned among the bolt's tasks
• Some examples:
– Round Robin
– Based on key
– All
– Specific instance
– …
30. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Storm Ups and Downs
• Really real time
• Very Powerful
• Built for performance
• Very low level (comparable to
MapReduce)
• Trivial tasks can become hard (sorting,
joins, …)
35. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Windowing
• You can group multiple batches together
into a sliding window.
• E.g. all the events from the last 60
seconds
36. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Streaming Strengths
• Works just like regular Spark processing,
just replace SparkContext with
StreamingContext
• Full integration with other Spark libraries
(Spark SQL, Spark Mllib, …)
• Ease of development
• Scalable, fault-tolerant, …
37. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark Streaming Example
41. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data output bottlenecks
• Pig & Hive are quite slow
• No visual feedback from results
• Specific calculations (cubing) of metrics
– Reporting tools cannot handle the
dimensions of the data
43. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Elasticsearch
• Document store (ideal for denormalized
data)
• Distributed
• Highly Available
• Open Source
• Real Time (Inserts & Searches)
45. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Hive Integration
• Writing to Elasticsearch from Hive
CREATE
EXTERNAL
TABLE
artists
(
id
BIGINT,
name
STRING,
links
STRUCT<url:STRING,
picture:STRING>)
STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'
=
'radio/artists');
-‐-‐
insert
data
to
Elasticsearch
from
another
table
called
'source'
INSERT
OVERWRITE
TABLE
artists
SELECT
NULL,
s.name,
named_struct('url',
s.url,
'picture',
s.picture)
FROM
source
s;
46. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Hive Integration
• Reading from Elasticsearch in Hive
CREATE
EXTERNAL
TABLE
artists
(
id
BIGINT,
name
STRING,
links
STRUCT<url:STRING,
picture:STRING>)
STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'
=
'radio/artists',
'es.query'
=
'?q=me*');
-‐-‐
stream
data
from
Elasticsearch
SELECT
*
FROM
artists;
47. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Pig Integration
• Writing to Elasticsearch from Pig
-‐-‐
load
data
from
HDFS
into
Pig
using
a
schema
A
=
LOAD
'src/test/resources/artists.dat'
USING
PigStorage()
AS
(id:long,
name,
url:chararray,
picture:
chararray);
-‐-‐
transform
data
B
=
FOREACH
A
GENERATE
name,
TOTUPLE(url,
picture)
AS
links;
-‐-‐
save
the
result
to
Elasticsearch
STORE
B
INTO
'radio/artists'
USING
org.elasticsearch.hadoop.pig.EsStorage();
48. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Pig Integration
• Reading from Elasticsearch in Pig
-‐-‐
execute
Elasticsearch
query
and
load
data
into
Pig
A
=
LOAD
'radio/artists'
USING
org.elasticsearch.hadoop.pig.EsStorage('es.query=?
me*');
DUMP
A;
49. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Integration
• Writing to Elasticsearch from Spark
import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
import
org.elasticsearch.spark._
val
conf
=
...
val
sc
=
new
SparkContext(conf)
-‐-‐
Create
RDD
here
rdd.saveToEs("spark/docs")
50. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Integration
• Reading from Elasticsearch in Spark
...
import
org.elasticsearch.spark._
...
val
conf
=
...
val
sc
=
new
SparkContext(conf)
sc.esRDD("radio/artists",
"?q=me*")
51. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Storm Integration
• Writing to Elasticsearch from Storm
import
org.elasticsearch.storm.EsBolt;
TopologyBuilder
builder
=
new
TopologyBuilder();
builder.setSpout("spout",
new
RandomSentenceSpout(),
10);
builder.setBolt("es-‐bolt",
new
EsBolt("storm/docs"),
5)
.shuffleGrouping("spout");
52. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Storm Integration
• Reading from Elasticsearch in Storm
import
org.elasticsearch.storm.EsSpout;
TopologyBuilder
builder
=
new
TopologyBuilder();
builder.setSpout("es-‐spout",
new
EsSpout("storm/docs",
"?q=me*),
5);
builder.setBolt("bolt",
new
PrinterBolt()).shuffleGrouping("es-‐
spout");
54. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Kibana
• Visualization tool
on top of
Elasticsearch
• Allows ad-hoc
querying &
graphing
• Support for real
time updates
• Create your own
dashboards