a real-time architecture using Hadoop and Storm at Devoxx

A real-time architecture using
Hadoop and Storm.
Nathan Bijnens

#DV13-#rtbigdata

@nathan_gs

Speaker

Nathan Bijnens
DataCrunchers
@nathan_gs

#DV13-#rtbigdata

@nathan_gs

Our Vision

Volume
Big Data

test

#DV13-#rtbigdata

@nathan_gs

Big Data

Velocity
test

#DV13-#rtbigdata

@nathan_gs

Our Vision

Volum
e

Variety
test

#DV13-#rtbigdata

@nathan_gs

Computing Trends
Past

Current

Computation (CPUs)
Expensive

Computation Cheap
(Many Core Computers)

Disk Storage Expensive

Disk Storage Cheap
(Cheap Commodity Disks)

DRAM Expensive

DRAM / SSD
Getting Cheap

Coordination Easy
(Latches Don’t Often Hit)

Coordination Hard
(Latches Stall a Lot, etc)

Source: Immutability Changes Everything - Pat Helland, RICON2012

#DV13-#rtbigdata

@nathan_gs

Credits
Nathan Marz

•
•
•
•
•

Ex-Backtype & Twitter
Startup in Stealthmode
Storm
Cascalog
ElephantDB
manning.com/marz

#DV13-#rtbigdata

@nathan_gs

A Data System

#DV13-#rtbigdata

@nathan_gs

Data is more than Information

Not all information is equal.
Some information is derived from
other pieces of information.

#DV13-#rtbigdata

@nathan_gs

Data is more than Information
Eventually you will reach the most
‘raw’ form of information.
This is the information you hold true, simple
because it exists.
Let’s call this ‘data’, very similar to ‘event’.

#DV13-#rtbigdata

@nathan_gs

Events - Before

Events used to manipulate
the master data.

#DV13-#rtbigdata

@nathan_gs

Events - After

Today, events are the master
data.

#DV13-#rtbigdata

@nathan_gs

Data System

Let’s store everything.

#DV13-#rtbigdata

@nathan_gs

Events

Data is Immutable

#DV13-#rtbigdata

@nathan_gs

Events

Data is Time Based

#DV13-#rtbigdata

@nathan_gs

Capturing change traditionally

Person

Location

Person

Location

Nathan

Antwerp

Nathan

Ghent

Geert

Dendermonde

Geert

Dendermonde

John

Ghent

John

Ghent

#DV13-#rtbigdata

@nathan_gs

Capturing change

Person

Location

Timestamp

Person

Location

Time

Nathan

Antwerp

2005-01-01

Nathan

Antwerp

2005-01-01

Geert

Dendermonde

2011-10-08

Geert

Dendermonde

2011-10-08

John

Ghent

2010-05-02

John

Ghent

2010-05-02

Nathan

Ghent

2013-02-03

#DV13-#rtbigdata

@nathan_gs

Query

The data you query is often
transformed, aggregated, ...
Rarely used in it’s original form.

#DV13-#rtbigdata

@nathan_gs

Query

Query = function ( all data )

#DV13-#rtbigdata

@nathan_gs

Number of people living in each city.

Person

Location

Time

Location

Count

Nathan

Antwerp

2005-01-01

Ghent

2

Geert

Dendermond
e

2011-10-08

Dendermonde

1

John

Ghent

2010-05-02

Nathan

Ghent

2013-02-03

#DV13-#rtbigdata

@nathan_gs

Query

All Data

#DV13-#rtbigdata

Query

@nathan_gs

Query: Precompute

All Data

#DV13-#rtbigdata

Precomputed
View

Query

@nathan_gs

Layered Architecture
Batch Layer

Speed Layer

Serving Layer

#DV13-#rtbigdata

@nathan_gs

Layered Architecture

Query

Cassandra

Incoming Data
Hadoop

#DV13-#rtbigdata

Elephant
DB

@nathan_gs

Batch Layer

#DV13-#rtbigdata

@nathan_gs

Batch Layer

Incoming Data
Hadoop

#DV13-#rtbigdata

Elephant
DB

@nathan_gs

Batch Layer

Unrestrained computation.

#DV13-#rtbigdata

@nathan_gs

Batch Layer

No need to De-Normalize.

#DV13-#rtbigdata

@nathan_gs

Batch Layer

Horizontal scalable.

#DV13-#rtbigdata

@nathan_gs

Batch Layer

High Latency.
Let’s pretend temporarily that update latency
doesn’t matter.

#DV13-#rtbigdata

@nathan_gs

Batch Layer

Functional computation, based on
immutable inputs, is idempotent.

#DV13-#rtbigdata

@nathan_gs

Batch Layer

Stores master copy of data
set...
append only.

#DV13-#rtbigdata

@nathan_gs

Batch: View generation
View #1

Master Dataset

MapReduc
e

View #2

View #3

#DV13-#rtbigdata

@nathan_gs

MapReduce

MAP

1. Take a large data set and divide it into subsets
…

2. Perform the same function on all subsets

REDUC
E

DoWork()

DoWork()

DoWork()

…

3. Combine the output from all subsets

#DV13-#rtbigdata

…

Output

@nathan_gs

MapReduce

#DV13-#rtbigdata

@nathan_gs

Serialization & Schema

Catch errors as quickly as they happen.
Validation on write vs on read.

#DV13-#rtbigdata

@nathan_gs


CSV is actually a serialization language
that is just poorly defined.

#DV13-#rtbigdata

@nathan_gs


• Use a format with a schema.
•
•
•

Thrift
Avro
Protobuffers

• Added bonus: it’s faster & uses less space.

#DV13-#rtbigdata

@nathan_gs

Batch View Database

Read only database.
No random writes required.

#DV13-#rtbigdata

@nathan_gs

Batch View Database

Every iteration produces the
Views from scratch.

#DV13-#rtbigdata

@nathan_gs

Batch View Database

• ElephantDB
• Splout
• Voldemort
•…

#DV13-#rtbigdata

@nathan_gs

Batch Layer
We are not done yet…

Just a few hours of data.

Data absorbed into Batch Views

#DV13-#rtbigdata

Now

Time

Not yet
absorbed.

@nathan_gs

Speed Layer

#DV13-#rtbigdata

@nathan_gs

Overview
Cassandra

Incoming Data
Hadoop

#DV13-#rtbigdata

Elephant
DB

@nathan_gs

Speed Layer

Stream processing.

#DV13-#rtbigdata

@nathan_gs

Speed Layer

Continuous computation.

#DV13-#rtbigdata

@nathan_gs

Speed Layer

Transactional.

#DV13-#rtbigdata

@nathan_gs

Speed Layer

Storing a limited window of
data.
Compensating for the last few hours of data.

#DV13-#rtbigdata

@nathan_gs

Speed Layer

All the complexity is isolated in the
Speed layer.
If anything goes wrong, it’s auto-corrected.

#DV13-#rtbigdata

@nathan_gs

CAP
You have a choice between:
Availability

•

•

Queries are eventual consistent.

• Consistency
•

Consistency

Queries are consistent.

#DV13-#rtbigdata

Partition
Tolerance

Availability

@nathan_gs

Eventual accuracy

Some algorithms are hard to
implement in real time. For those
cases we could estimate the results.

#DV13-#rtbigdata

@nathan_gs

Speed Layer
Real
Time
View 1

Incoming Data
Real
Time
View 2

#DV13-#rtbigdata

@nathan_gs

Storm

• Message passing.
• Distributed processing.
• Horizontally scalable.
• Incremental algorithms.
• Fast.
• Data in motion.
#DV13-#rtbigdata

@nathan_gs

Storm

#DV13-#rtbigdata

@nathan_gs

Storm

• Tuple
• Stream

#DV13-#rtbigdata

@nathan_gs

Storm

• Spout
• Bolt

#DV13-#rtbigdata

@nathan_gs

Storm

• Grouping

#DV13-#rtbigdata

@nathan_gs

Data Ingestion

• Kafka
• Flume
• Scribe
• *MQ
•…

#DV13-#rtbigdata

@nathan_gs

Speed Layer Views

• The views are stored in Read & Write database.
•
•
•
•
•
•

Cassandra
Hbase
Redis
MySQL
ElasticSearch
…

• Much more complex than a read only view.
#DV13-#rtbigdata

@nathan_gs

Serving Layer

#DV13-#rtbigdata

@nathan_gs

Overview

Query

Cassandra

Incoming Data
Hadoop

#DV13-#rtbigdata

Elephant
DB

@nathan_gs

Serving Layer

Random reads

#DV13-#rtbigdata

@nathan_gs

Serving Layer

This layer queries the Batch & Real
Time views and merges it.

#DV13-#rtbigdata

@nathan_gs

Serving Layer
Batch
Views

Merge
Real
Time
Views

#DV13-#rtbigdata

@nathan_gs

Serving Layer

How to query an Average?

#DV13-#rtbigdata

@nathan_gs

Overview

#DV13-#rtbigdata

@nathan_gs

CQRS

Source: martinfowler.com/bliki/CQRS.html – Martin Fowler

#DV13-#rtbigdata

@nathan_gs

Lambda Architecture

#DV13-#rtbigdata

@nathan_gs

Lambda Architecture

Can discard any view, batch and real
time, and just recreate everything from
the master data.

#DV13-#rtbigdata

@nathan_gs

Lambda Architecture

Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.

#DV13-#rtbigdata

@nathan_gs

Lambda Architecture

Data storage is highly optimized.

#DV13-#rtbigdata

@nathan_gs

Lambda Architecture

Immutability changes everything.

#DV13-#rtbigdata

@nathan_gs

Questions?

Questions?
@nathan_gs & #DV13
slideshare.net/nathan_gs
nathan@datacrunchers.eu

#DV13-#rtbigdata

@nathan_gs

DataCrunchers
We enable companies in envisioning, defining and
implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.

#DV13-#rtbigdata

@nathan_gs

Thank you

Thank you
@nathan_gs
nathan@datacrunchers.eu

#DV13-#rtbigdata

@nathan_gs

a real-time architecture using Hadoop and Storm at Devoxx

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à a real-time architecture using Hadoop and Storm at Devoxx

Similaire à a real-time architecture using Hadoop and Storm at Devoxx (20)

Plus de Nathan Bijnens

Plus de Nathan Bijnens (10)

Dernier

Dernier (20)

a real-time architecture using Hadoop and Storm at Devoxx

Notes de l'éditeur