NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Splice machine-bloor-webinar-data-lakes
1. Ge#ng
Started
with
Hadoop:
Opera4onal
Data
Lake
Rich
Reimer
VP,
Product
Management
rreimer@splicemachine.com
2. 2
The
Big
Squeeze
Data
growing
much
faster
than
IT
budgets
Source:
2013
IBM
Briefing
Book
Source:
Gartner,
Worldwide
IT,
Spending
forecast,
3Q13
Update
4. 4
Scale-‐Out:
The
Future
of
Databases
Drama:c
improvement
in
price/performance
Scale
Up
(Increase
server
size)
Scale
Out
(More
small
servers)
vs.
$ $
$
$
$
$
5. 5
What
is
a
Data
Lake?
• Scale-‐out
technology
based
on
Hadoop
• Data
stored
in
na4ve
formats
6. 6
Schema
on
Ingest
vs.
Schema
on
Read
§ Even
“schemaless”
MongoDB
requires
“schema”
- 10
Things
You
Should
Know
About
Running
MongoDB
At
Scale
• By
Asya
Kamsky,
Principal
Solu4ons
Architect
at
MongoDB
• Item
#1
–
“have
a
good
schema
and
indexing
strategy”
Schema
on Ingest
Schema
on Read
• Schema on Read
if you only use data
a few times a year
• Structured data
should always
remain structured
• Add schema if
data used regularly
Data Stream Application
7. 7
Who
Are
We?
THE
ONLY
HADOOP
RDBMS
Replace
your
old
RDBMS
with
a
scale-‐out
SQL
database
Affordable,
Scale-‐Out
ACID
Transac4ons
No
Applica4on
Rewrites
10x
Bemer
Price/Perf
8. 8
Reference
Architecture:
Opera4onal
Data
Lake
Offload
real-‐:me
repor:ng
and
analy:cs
from
expensive
OLTP
and
DW
systems
OLTP
Systems
Ad Hoc
Analytics
Operational
Data Lake
Executive
Business
Reports
Operational
Reports &
Analytics
ERP
CRM
Supply
Chain
HR
…
Data
Warehouse
Datamart
Stream or
Batch
Updates
ETL
Real-Time,
Event-Driven
Apps
9. Streamlining
the
Structured
Data
Pipeline
in
Hadoop
9
Source
Systems
ERP
…
CRM
Sqoop
Apply
Inferred
Schema
Stored as
flat files
SQL Query Engines BI Tools
Tradi=onal
Hadoop
Pipeline
vs.
Source
Systems
ERP
…
CRM
Existing
ETL Tool
Stored in
same
schema
BI Tools
Streamlined
Hadoop
Pipeline
Advantages
• Reduced
opera4onal
costs
with
less
complexity
• Reduced
processing
4me
and
errors
with
fewer
transla4ons
• Real-‐4me
updates
for
data
cleansing
• Bemer
SQL
support
10. 10
Streamlining
and
Hardening
the
ETL
Processing
Pipeline
Gracefully
handle
data
quality
issues
and
failed
queries
without
full
data
reloads
Issue
Hadoop
Issues
Splice
Machine
Solu=on
Handle
Data
Quality
Issues
(e.g.,
duplicates)
Hours
to
correct
✗ Run
slow
MapReduce
job
to
de-‐dupe
✗ Reload
en4re
data
set
(hours)
Seconds
to
correct
✓ Insert
fails
due
to
constraint
viola4on
✓ Rollback
flawed
updates
if
necessary
✓ Reject,
replace,
or
merge
duplicates
with
incremental
update
(ms
to
sec)
Update/Delete
Data
Hours
to
correct
✗ Reload
en4re
data
set
(hours)
✗ Writers
block
readers
Seconds
to
correct
✓ Correct
data
and
do
incremental
update
(ms
to
sec)
✓ Consistent
view
of
data
even
with
many
concurrent
updates
✓ Writers
don’t
block
readers
ETL
Failure
Hours
to
correct
✗ Reload
en4re
data
set
(hours)
✗ Miss
ETL
window,
leading
to
either
delayed
reports
or
stale
data
Seconds
to
correct
✓ Rollback
failed
step
✓ Retry
failed
step
and
con4nue
Fast
Query
Speeds
✗ Results
typically
no
faster
than
seconds
because
data
stored
in
random
formats
✗ MapReduce
✓ Results
possible
in
milliseconds
because
data
stored
in
highly
op4mized
format
✓ No
MapReduce
11. 11
Complemen4ng
Exis4ng
Hadoop-‐Based
Data
Lakes
Op:mizing
storage
and
querying
of
structured
data
as
part
of
ELT
or
Hadoop
query
engines
OLTP
Systems
ERP
CRM
Supply
Chain
HR
…
SCHEMA ON
INGEST:
Streamlined,
structured-to-
structured
integration
Structured
Data
Unstructured
Data
1
2
3
SCHEMA BEFORE READ:
Repository for structured data
or metadata from ELT process
on unstructured data
HCATALOG
Pig
SCHEMA ON READ:
Ad-hoc Hadoop queries
across structured and
unstructured data
12. Case
Study:
Opera4onal
Data
Lake
12
12
Overview
Computer
technology
corpora4on
Update
database
technology
for:
ODS
layer
replacement
ETL
processing
and
analysis
of
Omniture
data
Real-‐4me
OLTP
for
Global
Tech
Support
app
Challenges
Oracle
and
Teradata
too
expensive
to
scale
Many
Oracle
queries
couldn’t
complete
Can
only
hold
7
days
worth
of
data
in
Oracle
Missing
ETL
window
with
current
Hadoop
data
lake
Solu5on
Diagram
(400TB)
OLTP Systems
ERP
CRM
Supply
Chain
Benefits
75%
less
cost
with
commodity
scale
out
Incremental
ETL
processing
gracefully
handle
data
quality
issues
5x-‐10x
faster
comple4ng
queries
on
which
Oracle
failed
✔
13. 13
Reference
Architecture:
Unified
Customer
Profile
Improve
marke:ng
ROI
with
deeper
customer
intelligence
and
beKer
cross-‐channel
coordina:on
Unified
Customer Profile
(aka DMP)
Operational Reports for
Campaign Performance
Social
Feeds
Web/eCommerce
Clickstreams
WebsiteDatamart
Stream or Batch
Updates
BI Tools
Demand Side
Platform (DSP)
Ad Exchange
1st Party/
CRM Data
3rd Party Data
(e.g., Axciom)
Ad Perf. Data
(e.g., Doubleclick)
Email Mktg Data
Call Center Data
POS Data
Email
Marketing
App
Ad Hoc Audience
Segmentation
BI Tools
14. 14
Campaign
Management:
Harte-‐Hanks
Overview
Digital
marke4ng
services
provider
Unified
Customer
Profile
Real-‐4me
campaign
management
Complex
OLTP
and
OLAP
environment
Challenges
Oracle
RAC
too
expensive
to
scale
Queries
too
slow
–
even
up
to
½
hour
Ge#ng
worse
–
expect
30-‐50%
data
growth
Looked
for
9
months
for
a
cost-‐effec4ve
solu4on
Solu5on
Diagram
Ini5al
Results
¼
cost
with
commodity
scale
out
3-‐7x
faster
through
parallelized
queries
10-‐20x
price/perf
with
no
applica4on,
BI
or
ETL
rewrites
Cross-Channel
Campaigns
Real-Time
Personalization
Real-Time Actions
15. 15
Proven
Building
Blocks:
Hadoop
and
Derby
APACHE
DERBY
§
ANSI
SQL-‐99
RDBMS
§
Java-‐based
§
ODBC/JDBC
Compliant
APACHE
HBASE/HDFS
§ Auto-‐sharding
§ Real-‐4me
updates
§ Fault-‐tolerance
§ Scalability
to
100s
of
PBs
§ Data
replica4on
16. Typical
Database
Workloads
16
Opera=onal
Applica=ons
Opera=onal
Repor=ng
&
Analy=cs
Ad-‐Hoc
Analy=cs
Enterprise
Data
Warehouses
Typical
Databases
• MySQL
• Oracle
• MongoDB
• MySQL
• Oracle
• Greenplum
• Paraccel
• Netezza
• Teradata
• Oracle
• Sybase
IQ
Use
Cases
• OLTP
-‐
ERP,
CRM
• Websites
• Opera4onal
Datastores
• Exploratory
Analy4cs
• Data
Mining
• Enterprise
Repor4ng
Typical
Users
• Customers
• Opera4onal
Employees
• Opera4onal
Employees
• Analysts
• Data
Scien4sts
• Managers
• Execu4ves
Workload
Strengths
• High
concurrency
of
small
reads/
writes
• Range
queries
• Parameterized
reports
against
real-‐
4me
data
• Range
queries
• Complex
queries
requiring
full
table
scans
• Parameterized
reports
against
historical
data
17. 17
Internet
of
Things
Opera4onal
Data
Lake
Digital
Marke4ng
Personalized
Medicine
Use
Cases
Splice
Machine
|
Proprietary
&
Confiden4al
Fraud
Detec4on
18. 18
Opera4onal
Data
Lake:
Great
On-‐Ramp
to
Big
Data
§ Clear
Business
Value
Now
§ Replace
obsolete
Opera4onal
Data
Stores
(ODSs)
§ Exis4ng
use
cases
–
not
just
a
science
project
§ Hadoop
RDBMS
–
inexpensive
to
store
data
§ Incremental
On-‐Ramp
to
Big
Data
§ Start
with
structured
data
and
then
expand
to
unstructured
§ Add
schema
when
needed
19. Ge#ng
Started
with
Hadoop:
Opera4onal
Data
Lake
Rich
Reimer
VP,
Product
Management
rreimer@splicemachine.com