The modern analytics architecture

The Modern Analytics
Architecture
Making Big Data UsefulJoseph D’Antoni, Solutions Architect
Anexinet
May 7-9, 2014 | San Jose, CA

Joey D’Antoni
Joey has over 15 years of experience with a wide variety of data platforms, in both
Fortune 50 companies as well as smaller organizations
He is a frequent speaker on database administration, big data, and career
management
He is the co-president of the Philadelphia SQL Server User’s Group
He wants you to make sure you can restore your data

Agenda
• Data Warehouses—how did we get here?
• Big Data—Hadoop and more
• Modern Analytic Tools
• Building Our New Architecture
4

Data Warehouses—A History
• Data Warehousing had it origins in
the 1970s—A.C. Nielsen provided
clients with data marts
• In 1988—Bill Inmon (IBM) published
“An Architecture for a Business
Information System”
• In 1996—Ralph Kimball published
“The Data Warehouse Toolkit” which
showcased models for OLAP style
modelling
5

Data Warehouse Models
• Star Schema
• Advantage is that the DW is easier
to use
• Facts and dimensions allow queries
to perform faster
• Loading and ETL become more
complicated
• Structure changes are very
expensive
Dimensional Model
6

Data Warehouse Model
• Tables are grouped by subject area
(consumer, finance, products)
• Tables are linked by joins
• Very easy to add information into
the database
• Queries are harder to write, and
joins can be very expensive
performance wise
Normalization
7

Data Warehousing Challenges
Data Quality
ETL
Performance and Scalability
Costs—Licensing and
Hardware
8

Extract, Transform, Load (ETL) Process
10
Some Database Business
Doesn’t Care
About
Process
Your
Some
Credit—Buck Woody, Microsoft

Performance and Scalability
Given the volume of data,
DW queries can be very
slow
We use techniques like
data compression to make
them faster
CPU was older problem—
now tends to be storage
11

Costs
Data Warehouses need large
servers
Database systems are
licensed by the size of the
server (core)
Data Warehouses need a
whole lot fast storage
Large volumes of fast storage
(SANs) are expensive
12

Classic Data Analysis
Data Warehouse &
BI Solutions
ETL
…Uses Just a Subset

Common Technical Themes
There are a lot of “big data” solutions, but most of
have a lot of things in common
• Built in HA/DR through multiple copies of the data
• Designed for analytics processing more than OLTP
• Derived from Open Source solutions
• Designed around local storage and commodity hardware

Components Of Modern Architecture
Hadoop
• (And it’s ecosystem)
EDW
Analytics Engine
Visualization Engine

Big Data Workflow for Combined Data and Analytics
Data Acquire Organize Analyze Decide
StructuredSemi-StructuredUn-Structured
Master and
Reference
Transactions
Machine
Generated
(Logs)
Web
Text, Image,
Audio, Video
DBMS (OLTP)
Files
NoSQL
(Key Value
Data Store)
HDFS
ETL/ELT
Change Data
Capture
Real-Time
Message-
Based
Hadoop MR
ODS
Data
Warehouse
Streaming
(CEP Engine)
In-
Database
Analytics
Analytics
• Reporting and
dashboards
• Alerting and
recommendations
• EPM, Social Apps
• Text analytics and
search
• Advanced
analytics
• Interactive
discovery
Hardware
Big Data
Cluster
High
Speed
Network
RDBMS
Cluster
In-
Memory
Analytics
Source—Gartner,
Credit Suisse, 8/12

CPUs
19
Hadoop
Project Starts
Exadata
Launched

Costs—Big Data versus Data
Warehouse
20
$-
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
Server Storage Licensing Total
Hadoop and Data Warehouse Costs
Hadoop Data Warehouse
• For same costs you build a
15-node Hadoop cluster
• The Hadoop cluster would
have 3840 GB of RAM
versus the 1024 in the DW
sever

Hadoop
Hadoop is the leading Big Data platform
(eco-system)
Invented by Yahoo
• Scales Horizontally (2 socket x86 servers in
massive clusters)
• Uses big, slow, local storage
• Extremely fault-tolerant
• In a nutshell—it’s a Distributed File System (3
copies of data in cluster) and a programming
framework called MapReduce

Introducing Hadoop
23
Host 1
Name Node
Host 3
Data Node
Host 5
Data Node
Host 2
Secondary
Name Node
Host 4
Data Node
Host 6
Data Node

How Map Reduce Works
24
• Automatic
parallelism
• Fault tolerance

Map Phase
Input File: foo.log
HDFS
Block 1
HDFS
Block 19
HDFS
Block 105
1) Read splits
into records
Split 1
K:0 V…
Map
Task 1
K:INFO
V…
Split 2
K:123
V…
Map
Task 2
K:INFO V:1
K:WARN V:1
Split 3
K:332 V…
K:368 V…
Map
Task 3
K:Debug
V:1
K:INFO V:1
2) Run Map
3) Write and
Sort Output

Hadoop Ecosystem
HDFS
MapReduce
Note: This is only a
subset of ecosystem!

Spark and Shark
• Hadoop 2
Enhancements
• Spark is in-memory
• Shark integrates Spark
with Hive
28

Hadoop Architectural Decisions
• Distribution
• Components
• Support
• Cloud vs On-Premises

Choosing Your Hadoop Distribution

Hadoop Vendors
Technology Vendor Description
Hadoop Distributions Apache Completely open source
software for distributed
clusters and map/reduce
Cloudera Industry leading commercial
distribution, good
management tools
Hortonworks Open source distribution—
Apache compatible
MapR Multiple enhancements to
Apache Hadoop (rewrite of
HDFS), high performance,
enterprise ready
Pivotal HD EMC spinoff with strong
financial backing, this is full
high performance RDBMS
(with BI connectors) on top of
Hadoop

Cloud vs On-Premises
32
• Short Term Use
• Rapid Scale
• Test Use Cases
• Pay as you go
• Internet data
source
• Large long term
implementations
• Well known workloads
• Shared clusters
• Large initial investment
On-Premises

Analytics
Hadoop is was
not fast
Full scans of files
So How Do We
Rapidly Analyze
Data?
34

Columnar Databases
Microsoft SQL Server (2012
& 2014)
PDW
HP Vertica
HBase
ParAccel
InfiniDB
EMC Greenplum
35

In-Memory Databases
SQL Server 2014
SAP Hana
Oracle Times Ten
VoltDB
Apache Spark
36

Analytics Tools Past and Present
37

Tools for Data Visualization
Excel (Power View and Power
Map)
Tableau
Qlik
Platfora
Pentaho

Bringing This All Together
Power Query (Excel)
40
Some Database Business
Doesn’t Care
About
Process
Your
Some

Session Evaluations
Submit by 5pmFriday May
9 to WIN prizes
Your feedback is
important and valuable.
ways to access
Go to
passbac2014/evals
Download the PASS EVENT
App from your App Store
and search: PASS BAC
2014
Follow the QR code link
displayed on session
signage throughout the
conference venue and in
the program guide

for attending this session and
the PASS Business Analytics
Conference 2014
Thank
You
May 7-9, 2014 | San Jose, CA

The modern analytics architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The modern analytics architecture

Similar to The modern analytics architecture (20)

More from Joseph D'Antoni

More from Joseph D'Antoni (20)

Recently uploaded

Recently uploaded (20)

The modern analytics architecture