3. Joey D’Antoni
Joey has over 15 years of experience with a wide variety of data platforms, in both
Fortune 50 companies as well as smaller organizations
He is a frequent speaker on database administration, big data, and career
management
He is the co-president of the Philadelphia SQL Server User’s Group
He wants you to make sure you can restore your data
4. Agenda
• Data Warehouses—how did we get here?
• Big Data—Hadoop and more
• Modern Analytic Tools
• Building Our New Architecture
4
5. Data Warehouses—A History
• Data Warehousing had it origins in
the 1970s—A.C. Nielsen provided
clients with data marts
• In 1988—Bill Inmon (IBM) published
“An Architecture for a Business
Information System”
• In 1996—Ralph Kimball published
“The Data Warehouse Toolkit” which
showcased models for OLAP style
modelling
5
6. Data Warehouse Models
• Star Schema
• Advantage is that the DW is easier
to use
• Facts and dimensions allow queries
to perform faster
• Loading and ETL become more
complicated
• Structure changes are very
expensive
Dimensional Model
6
7. Data Warehouse Model
• Tables are grouped by subject area
(consumer, finance, products)
• Tables are linked by joins
• Very easy to add information into
the database
• Queries are harder to write, and
joins can be very expensive
performance wise
Normalization
7
10. Extract, Transform, Load (ETL) Process
10
Some Database Business
Doesn’t Care
About
Process
Your
Some
Credit—Buck Woody, Microsoft
11. Performance and Scalability
Given the volume of data,
DW queries can be very
slow
We use techniques like
data compression to make
them faster
CPU was older problem—
now tends to be storage
11
12. Costs
Data Warehouses need large
servers
Database systems are
licensed by the size of the
server (core)
Data Warehouses need a
whole lot fast storage
Large volumes of fast storage
(SANs) are expensive
12
15. Common Technical Themes
There are a lot of “big data” solutions, but most of
have a lot of things in common
• Built in HA/DR through multiple copies of the data
• Designed for analytics processing more than OLTP
• Derived from Open Source solutions
• Designed around local storage and commodity hardware
16. Components Of Modern Architecture
Hadoop
• (And it’s ecosystem)
EDW
Analytics Engine
Visualization Engine
17. Big Data Workflow for Combined Data and Analytics
Data Acquire Organize Analyze Decide
StructuredSemi-StructuredUn-Structured
Master and
Reference
Transactions
Machine
Generated
(Logs)
Web
Text, Image,
Audio, Video
DBMS (OLTP)
Files
NoSQL
(Key Value
Data Store)
HDFS
ETL/ELT
Change Data
Capture
Real-Time
Message-
Based
Hadoop MR
ODS
Data
Warehouse
Streaming
(CEP Engine)
In-
Database
Analytics
Analytics
• Reporting and
dashboards
• Alerting and
recommendations
• EPM, Social Apps
• Text analytics and
search
• Advanced
analytics
• Interactive
discovery
Hardware
Big Data
Cluster
High
Speed
Network
RDBMS
Cluster
In-
Memory
Analytics
Source—Gartner,
Credit Suisse, 8/12
20. Costs—Big Data versus Data
Warehouse
20
$-
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
$350,000.00
Server Storage Licensing Total
Hadoop and Data Warehouse Costs
Hadoop Data Warehouse
• For same costs you build a
15-node Hadoop cluster
• The Hadoop cluster would
have 3840 GB of RAM
versus the 1024 in the DW
sever
22. Hadoop
Hadoop is the leading Big Data platform
(eco-system)
Invented by Yahoo
• Scales Horizontally (2 socket x86 servers in
massive clusters)
• Uses big, slow, local storage
• Extremely fault-tolerant
• In a nutshell—it’s a Distributed File System (3
copies of data in cluster) and a programming
framework called MapReduce
23. Introducing Hadoop
23
Host 1
Name Node
Host 3
Data Node
Host 5
Data Node
Host 2
Secondary
Name Node
Host 4
Data Node
Host 6
Data Node
24. How Map Reduce Works
24
• Automatic
parallelism
• Fault tolerance
31. Hadoop Vendors
Technology Vendor Description
Hadoop Distributions Apache Completely open source
software for distributed
clusters and map/reduce
Cloudera Industry leading commercial
distribution, good
management tools
Hortonworks Open source distribution—
Apache compatible
MapR Multiple enhancements to
Apache Hadoop (rewrite of
HDFS), high performance,
enterprise ready
Pivotal HD EMC spinoff with strong
financial backing, this is full
high performance RDBMS
(with BI connectors) on top of
Hadoop
32. Cloud vs On-Premises
32
• Short Term Use
• Rapid Scale
• Test Use Cases
• Pay as you go
• Internet data
source
• Large long term
implementations
• Well known workloads
• Shared clusters
• Large initial investment
On-Premises
42. Session Evaluations
Submit by 5pmFriday May
9 to WIN prizes
Your feedback is
important and valuable.
ways to access
Go to
passbac2014/evals
Download the PASS EVENT
App from your App Store
and search: PASS BAC
2014
Follow the QR code link
displayed on session
signage throughout the
conference venue and in
the program guide
43. for attending this session and
the PASS Business Analytics
Conference 2014
Thank
You
May 7-9, 2014 | San Jose, CA