Splice machine-bloor-webinar-data-lakes

Ge#ng
Started
with
Hadoop:

Opera4onal
Data
Lake

Rich
Reimer

VP,
Product
Management

rreimer@splicemachine.com

2

The
Big
Squeeze

Data
growing
much
faster
than
IT
budgets

Source:
2013
IBM
Brieﬁng
Book

Source:
Gartner,
Worldwide
IT,

Spending
forecast,
3Q13
Update

Tradi4onal
RDBMSs
Giants
Overwhelmed…

Scale-‐up
becoming
cost-‐prohibi:ve

Splice
Machine
|
Proprietary
&
Conﬁden4al

4

Scale-‐Out:
The
Future
of
Databases

Drama:c
improvement
in
price/performance

Scale
Up

(Increase
server
size)

Scale
Out

(More
small
servers)

vs.

$ $
$
$
$
$

5

What
is
a
Data
Lake?

•  Scale-‐out
technology

based
on
Hadoop

•  Data
stored
in
na4ve

formats

6

Schema
on
Ingest
vs.
Schema
on
Read

§  Even
“schemaless”
MongoDB
requires
“schema”

-  10
Things
You
Should
Know
About
Running
MongoDB
At
Scale

•  By
Asya
Kamsky,
Principal
Solu4ons
Architect
at
MongoDB

•  Item
#1
–
“have
a
good
schema
and
indexing
strategy”

Schema
on Ingest
Schema
on Read
•  Schema on Read
if you only use data
a few times a year
•  Structured data
should always
remain structured
•  Add schema if
data used regularly
Data Stream Application

7

Who
Are
We?

THE
ONLY

HADOOP
RDBMS

Replace
your
old
RDBMS

with
a
scale-‐out
SQL
database

Aﬀordable,
Scale-‐Out

ACID
Transac4ons

No
Applica4on
Rewrites

10x

Bemer

Price/Perf

8

Reference
Architecture:
Opera4onal
Data
Lake

Oﬄoad
real-‐:me
repor:ng
and
analy:cs
from
expensive
OLTP
and
DW
systems

OLTP
Systems
Ad Hoc
Analytics
Operational
Data Lake
Executive
Business
Reports
Operational
Reports &
Analytics
ERP
CRM
Supply
Chain
HR
…
Data
Warehouse
Datamart
Stream or
Batch
Updates
ETL
Real-Time,
Event-Driven
Apps

Streamlining
the
Structured
Data
Pipeline
in
Hadoop

9

Source
Systems
ERP
…
CRM
Sqoop
Apply
Inferred
Schema
Stored as
flat files
SQL Query Engines BI Tools
Tradi=onal
Hadoop
Pipeline

vs.

Source
Systems
ERP
…
CRM
Existing
ETL Tool
Stored in
same
schema
BI Tools
Streamlined
Hadoop
Pipeline

Advantages

•  Reduced
opera4onal
costs

with
less
complexity

•  Reduced
processing
4me
and

errors
with
fewer
transla4ons

•  Real-‐4me
updates
for
data

cleansing

•  Bemer
SQL
support

10

Streamlining
and
Hardening
the
ETL
Processing
Pipeline

Gracefully
handle
data
quality
issues
and
failed
queries
without
full
data
reloads

Issue
Hadoop
Issues
Splice
Machine
Solu=on

Handle
Data

Quality
Issues

(e.g.,
duplicates)

Hours
to
correct

✗  Run
slow
MapReduce
job
to
de-‐dupe

✗  Reload
en4re
data
set
(hours)

Seconds
to
correct

✓ Insert
fails
due
to
constraint
viola4on

✓ Rollback
ﬂawed
updates
if
necessary

✓ Reject,
replace,
or
merge
duplicates
with
incremental

update
(ms
to
sec)

Update/Delete

Data

Hours
to
correct

✗  Reload
en4re
data
set
(hours)

✗  Writers
block
readers

Seconds
to
correct

✓ Correct
data
and
do
incremental
update
(ms
to
sec)

✓ Consistent
view
of
data
even
with
many
concurrent
updates

✓ Writers
don’t
block
readers

ETL
Failure
Hours
to
correct

✗  Reload
en4re
data
set
(hours)

✗  Miss
ETL
window,
leading
to
either
delayed

reports
or
stale
data

Seconds
to
correct

✓ Rollback
failed
step

✓ Retry
failed
step
and
con4nue

Fast
Query
Speeds
✗  Results
typically
no
faster
than
seconds
because

data
stored
in
random
formats

✗  MapReduce

✓ Results
possible
in
milliseconds
because
data
stored
in

highly
op4mized
format

✓ No
MapReduce

11

Complemen4ng
Exis4ng
Hadoop-‐Based
Data
Lakes

Op:mizing
storage
and
querying
of
structured
data
as
part
of
ELT
or
Hadoop
query
engines

OLTP
Systems
ERP
CRM
Supply
Chain
HR
…
SCHEMA ON
INGEST:
Streamlined,
structured-to-
structured
integration
Structured
Data
Unstructured
Data
1

2

3

SCHEMA BEFORE READ:
Repository for structured data
or metadata from ELT process
on unstructured data
HCATALOG
Pig
SCHEMA ON READ:
Ad-hoc Hadoop queries
across structured and
unstructured data

Case
Study:
Opera4onal
Data
Lake

12
12

Overview

  Computer
technology
corpora4on

  Update
database
technology
for:

  ODS
layer
replacement

  ETL
processing
and
analysis
of
Omniture
data

  Real-‐4me
OLTP
for
Global
Tech
Support
app

Challenges

  Oracle
and
Teradata
too
expensive
to
scale

  Many
Oracle
queries
couldn’t
complete

  Can
only
hold
7
days
worth
of
data
in
Oracle

  Missing
ETL
window
with
current
Hadoop
data
lake

Solu5on
Diagram

(400TB)

OLTP Systems
ERP
CRM
Supply
Chain
Beneﬁts

75%
less
cost

with
commodity
scale
out

Incremental
ETL
processing

gracefully
handle
data
quality
issues

5x-‐10x
faster

comple4ng
queries
on
which
Oracle
failed

✔

13

Reference
Architecture:
Uniﬁed
Customer
Proﬁle

Improve
marke:ng
ROI
with
deeper
customer
intelligence
and
beKer
cross-‐channel
coordina:on

Unified
Customer Profile
(aka DMP)
Operational Reports for
Campaign Performance
Social
Feeds
Web/eCommerce
Clickstreams
WebsiteDatamart
Stream or Batch
Updates
BI Tools
Demand Side
Platform (DSP)
Ad Exchange
1st Party/
CRM Data
3rd Party Data
(e.g., Axciom)
Ad Perf. Data
(e.g., Doubleclick)
Email Mktg Data
Call Center Data
POS Data
Email
Marketing
App
Ad Hoc Audience
Segmentation
BI Tools

14

Campaign
Management:
Harte-‐Hanks

Overview

  Digital
marke4ng
services
provider

  Unified
Customer
Profile

  Real-‐4me
campaign
management

  Complex
OLTP
and
OLAP
environment

Challenges

  Oracle
RAC
too
expensive
to
scale

  Queries
too
slow
–
even
up
to
½
hour

  Ge#ng
worse
–
expect
30-‐50%
data
growth

  Looked
for
9
months
for
a
cost-‐effec4ve
solu4on

Solu5on
Diagram

Ini5al
Results

¼
cost

with
commodity
scale
out

3-‐7x
faster

through
parallelized
queries

10-‐20x
price/perf

with
no
applica4on,
BI
or
ETL
rewrites

Cross-Channel
Campaigns
Real-Time
Personalization
Real-Time Actions

15

Proven
Building
Blocks:
Hadoop
and
Derby

APACHE
DERBY

§ 
ANSI
SQL-‐99
RDBMS

§ 
Java-‐based

§ 
ODBC/JDBC
Compliant

APACHE
HBASE/HDFS

§  Auto-‐sharding

§  Real-‐4me
updates

§  Fault-‐tolerance

§  Scalability
to
100s
of
PBs

§  Data
replica4on

Typical
Database
Workloads

16

Opera=onal

Applica=ons

Opera=onal

Repor=ng
&
Analy=cs

Ad-‐Hoc
Analy=cs
Enterprise
Data

Warehouses

Typical

Databases

•  MySQL

•  Oracle

•  MongoDB

•  MySQL

•  Oracle

•  Greenplum

•  Paraccel

•  Netezza

•  Teradata

•  Oracle

•  Sybase
IQ

Use
Cases
•  OLTP
-‐
ERP,
CRM

•  Websites

•  Opera4onal

Datastores

•  Exploratory
Analy4cs

•  Data
Mining

•  Enterprise
Repor4ng

Typical
Users
•  Customers

•  Opera4onal

Employees

•  Opera4onal

Employees

•  Analysts

•  Data
Scien4sts

•  Managers

•  Execu4ves

Workload

Strengths

•  High
concurrency
of

small
reads/
writes

•  Range
queries

•  Parameterized

reports
against
real-‐
4me
data

•  Range
queries

•  Complex
queries

requiring
full
table

scans

•  Parameterized

reports
against

historical
data

17

Internet
of
Things

Opera4onal
Data
Lake
Digital
Marke4ng

Personalized

Medicine

Use
Cases

Splice
Machine
|
Proprietary
&
Conﬁden4al

Fraud
Detec4on

18

Opera4onal
Data
Lake:
Great
On-‐Ramp
to
Big
Data

§  Clear
Business
Value
Now

§  Replace
obsolete
Opera4onal
Data
Stores
(ODSs)

§  Exis4ng
use
cases
–
not
just
a
science
project

§  Hadoop
RDBMS
–
inexpensive
to
store
data

§  Incremental
On-‐Ramp
to
Big
Data

§  Start
with
structured
data
and
then
expand
to

unstructured

§  Add
schema
when
needed

Splice machine-bloor-webinar-data-lakes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (11)

Similaire à Splice machine-bloor-webinar-data-lakes

Similaire à Splice machine-bloor-webinar-data-lakes (20)

Plus de Edgar Alejandro Villegas

Plus de Edgar Alejandro Villegas (20)

Dernier

Dernier (20)

Splice machine-bloor-webinar-data-lakes