2. www.luxoft.com
Agenda
1
2
Big Data – what is it
Hadoop vs RDBMS – pros and cons
3 Hadoop & Enterprise architecture
4 Hadoop as ETL engine
5 Case Studies
4. www.luxoft.com
Current state
Big data - is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications.
5. www.luxoft.com
Limitations & Problems
Big data is difficult to work with using
most relational databases, requiring
instead massively parallel software
running on tens, hundreds, or even
thousands of servers
eBay.com uses two data warehouses at 7.5 petabytes
Walmart handles more than 1 million customer
transactions every hour
Facebook handles 50 billion photos from its user base
In 2012, the Obama administration announced the Big
Data Research and Development Initiative
7. www.luxoft.com
CORE HADOOP - MapReduce
In 2004, Google published a paper on a process called MapReduce
DISTRIBUTED
COMPUTING
FRAMEWORK
Process large jobs in
parallel across many
nodes and combine the
results
8. www.luxoft.com
Hadoop Structure
HDFS is a distributed file system designed to run on commodity hardware
HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)
Hive provide data summarization, query, and analysis (SQL-like interface)
Pig is a platform for analyzing large data sets that consists of a high-level language
9. www.luxoft.com
Hadoop vs RDBMS
Hadoop RDBMS
Performance for relational data
Machine query optimization
Mature workload management
High concurrency interactive query
processing
How might this change in the future
Query Optimization Improvements in Hive
– Statistics, better join ordering, more join types, etc
Startup Time Improvements
– Simpler query plans to pass out
Runtime Performance Improvements
Schema-less Model
Human query optimization
Ability to create complex dataflow
with multiple inputs and outputs
Parallelize many Analytic Functions
13. www.luxoft.com
Case Study 1
Hadoop as ETL Data Quality tool
BENEFITS
Reduced TCO (commodity hardware usage)
Traceability of all the data quality issues
Hadoop becomes clean data tool.
PROBLEM
Traditional tools show poor performance in exception
and data cleansing.
SOLUTION
Hadoop transforms the data into single format and
processes it using data cleansing workflows.
14. www.luxoft.com
Case Study 2
Know Your Customer PoC
Business Challenge
• Knowing the actual customer
reaction to products is essential
for business growth, but it’s
difficult to get valuable insights.
Social media is the place where
customer really share their
opinion
SOLUTION
Hadoop-based analysis tool that
provides the ability to:
• Find the events in the client
streams, identify needed
reaction
• Propose a product to a client,
based on his interests
15. www.luxoft.com
Case Study 3
Enterprise ETL & Hadoop Integration
Goals:
MapReduce ETL jobs development
without coding
Build, re-use, and check impact analysis
with enhanced metadata capabilities
A windows-based graphical development
environment
Comprehensive built-in transformations
A library of Use Case Accelerators to
fast-track Hadoop productivity
16. www.luxoft.com
Big Data:
Cutting edge of DI technologies
State-of-the-art design approaches
A bit more than simple development, it's some of art, art
of data management
Summary