Real Time Big Data

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science Company
Real Time Big Data
InfoFarm Seminar
18/11/2015

About Me

About InfoFarm

Agenda
•  Typical Big Data Landscape
•  The need for Real Time Big Data
•  Real Time Data Ingestion
•  Tools for Real Time Big Data
– Apache Spark
– Apache Storm
– Search
•  Q&A
•  Lunch

A Typical Big Data Landscape

A Typical Big Data Landscape
•  Data Silo
•  Batch environment
•  Periodical Analytics/statistics
•  Data Source for new systems

The need for Real Time Big Data
•  Obtaining analytical results faster
–  Processing faster than once a day
•  Load evens out over day
•  Past/Present/Future
–  Alert for certain events
–  Updating Prediction models on-the-fly
•  Allow faster feedback to end users
–  See results of your actions right away

Perfect fits for Real Time Processing
•  Anomaly Detection
–  Abnormal readings of sensors
–  Abnormal amounts of log files
–  Fraud detection
•  Real Time updates to Recommender models
–  Fast new recommendations in e-commerce
–  Support for trending items
–  Fast responses to events happening right now
•  Real Time updates of clustering models
•  Improving Classification based un current events
•  Can be run side-by-side with traditional historical models

Ingestion Processing Output

Ingestion

Apache Kafka
•  Fast
•  Scalable
•  Durable
•  Distributed

Apache Kafka - Overview
•  Producers write
messages to Kafka
topics
•  Consumers process
messages from a
topic
•  Kafka runs on a
cluster of server
where each server
is called a broker

Apache Kafka - Topics
•  Topics are split up in
different partitions
•  Partitions are
replicated across the
cluster
•  Order of messages is
guaranteed
•  Messages are stored
for a period of time
•  Producers decide
which partition they
write to
•  Consumers keep the
offset of which
messages they have
read

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
DEMO

The Hadoop Ecosystem

HDFS

Distributed
File
System
Amazon
S3
Local
FS

YARN

Resource
Management

MapReduce

HBase

NoSQL

Hive

Data
Mart

Pig

ScripCng

Sqoop

SQL

Import

Export

Mahout

Machine

Learning

…

The Hadoop Ecosystem

HDFS

Distributed
File
System
Amazon
S3
Local
FS

YARN

Resource
Management

MapReduce

HBase

NoSQL

Hive

Data
Mart

Pig

ScripCng

Sqoop

SQL

Import

Export

Mahout

Machine

Learning

…

Spark
Storm
…

Spark
SQL

Spark

MLlib

Apache Storm

Spouts

Spouts
•  Source of streams into the topology
•  Can be reliable or unreliable
•  Support for:
–  Kafka
–  Kestrel
–  RabbitMQ
–  JMS
–  Amazon Kinesis
–  Build your own (e.g. twitter)

Bolts

Bolts
•  Where all the processing happens
•  Filtering, functions, aggregations, joins,
database updates, …
•  You subscribe to streams of a different
component (other bolts/spouts)
•  Must ack every tuple they process

Parallelism
•  Spouts & Bolts
actually run as
multiple instances
on different
machines
•  Making sure that
the correct
messages goes to
the correct instance
is up to the
developer

Stream Groupings

Stream Groupings
•  Defines how a stream should be
partitioned among the bolt's tasks
•  Some examples:
– Round Robin
– Based on key
– All
– Specific instance
– …

Storm Ups and Downs
•  Really real time
•  Very Powerful
•  Built for performance
•  Very low level (comparable to
MapReduce)
•  Trivial tasks can become hard (sorting,
joins, …)

Spark Streaming

Spark Architecture

Spark Streaming Concepts

Spark Streaming Input
•  Kafka
•  Flume
•  Kinesis
•  Twitter
•  ZeroMQ
•  HDFS
•  TCP Sockets

Windowing
•  You can group multiple batches together
into a sliding window.
•  E.g. all the events from the last 60
seconds

Spark Streaming Strengths
•  Works just like regular Spark processing,
just replace SparkContext with
StreamingContext
•  Full integration with other Spark libraries
(Spark SQL, Spark Mllib, …)
•  Ease of development
•  Scalable, fault-tolerant, …

Spark Streaming Example

Getting to Your Data

Data output bottlenecks
•  Pig & Hive are quite slow
•  No visual feedback from results
•  Specific calculations (cubing) of metrics
– Reporting tools cannot handle the
dimensions of the data

Elasticsearch
•  Document store (ideal for denormalized
data)
•  Distributed
•  Highly Available
•  Open Source
•  Real Time (Inserts & Searches)

ES-Hadoop

Hive Integration
•  Writing to Elasticsearch from Hive
CREATE
EXTERNAL
TABLE
artists
(

id

BIGINT,

name

STRING,

links

STRUCT<url:STRING,
picture:STRING>)

STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource'
=
'radio/artists');

-‐-‐
insert
data
to
Elasticsearch
from
another
table
called

'source'

INSERT
OVERWRITE
TABLE
artists

SELECT
NULL,
s.name,
named_struct('url',
s.url,
'picture',

s.picture)

FROM
source
s;

Hive Integration
•  Reading from Elasticsearch in Hive
CREATE
EXTERNAL
TABLE
artists
(

id

BIGINT,

name

STRING,

links

STRUCT<url:STRING,
picture:STRING>)

STORED
BY
'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource'
=
'radio/artists',

'es.query'
=
'?q=me*');

-‐-‐
stream
data
from
Elasticsearch

SELECT
*
FROM
artists;

Pig Integration
•  Writing to Elasticsearch from Pig
-‐-‐
load
data
from
HDFS
into
Pig
using
a
schema

A
=
LOAD
'src/test/resources/artists.dat'
USING
PigStorage()

AS
(id:long,
name,
url:chararray,
picture:

chararray);

-‐-‐
transform
data

B
=
FOREACH
A
GENERATE
name,
TOTUPLE(url,
picture)
AS
links;

-‐-‐
save
the
result
to
Elasticsearch

STORE
B
INTO
'radio/artists'
USING

org.elasticsearch.hadoop.pig.EsStorage();

Pig Integration
•  Reading from Elasticsearch in Pig
-‐-‐
execute
Elasticsearch
query
and
load
data
into
Pig

A
=
LOAD
'radio/artists'

USING
org.elasticsearch.hadoop.pig.EsStorage('es.query=?
me*');

DUMP
A;

Spark Integration
•  Writing to Elasticsearch from Spark
import
org.apache.spark.SparkContext

import
org.apache.spark.SparkContext._

import
org.elasticsearch.spark._

val
conf
=
...

val
sc
=
new
SparkContext(conf)

-‐-‐
Create
RDD
here

rdd.saveToEs("spark/docs")

Spark Integration
•  Reading from Elasticsearch in Spark
...

import
org.elasticsearch.spark._

...

val
conf
=
...

val
sc
=
new
SparkContext(conf)

sc.esRDD("radio/artists",
"?q=me*")

Storm Integration
•  Writing to Elasticsearch from Storm
import
org.elasticsearch.storm.EsBolt;

TopologyBuilder
builder
=
new
TopologyBuilder();

builder.setSpout("spout",
new
RandomSentenceSpout(),
10);

builder.setBolt("es-‐bolt",
new
EsBolt("storm/docs"),
5)

.shuffleGrouping("spout");

Storm Integration
•  Reading from Elasticsearch in Storm
import
org.elasticsearch.storm.EsSpout;

TopologyBuilder
builder
=
new
TopologyBuilder();

builder.setSpout("es-‐spout",
new
EsSpout("storm/docs",
"?q=me*),

5);

builder.setBolt("bolt",
new
PrinterBolt()).shuffleGrouping("es-‐
spout");

Visualizing data

Kibana
•  Visualization tool
on top of
Elasticsearch
•  Allows ad-hoc
querying &
graphing
•  Support for real
time updates
•  Create your own
dashboards

Demo

Wrap Up
Ingestion Processing Output

Real Time Big Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Real Time Big Data

Similaire à Real Time Big Data (20)

Dernier

Dernier (20)

Real Time Big Data