Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Testing Big Data: Automated Testing of Hadoop with QuerySurge
1. built by
Bill Hayduk
CEO/President
RTTS
Testing Big Data:
Automated ETL Testing of Hadoop
Jeff Bocarsly, Ph.D.
Chief Architect
QuerySurge Division, RTTS
built by
QuerySurge™
Automate your
Data Warehouse & Big Data Testing
and Reap the Benefits
2. built by
QuerySurge™
Today’s Agenda
• About Big Data and Hadoop
• Data Warehouse refresher
• Hadoop and DWH Use Case
• How to test Big Data
• Demo of QuerySurge & Hadoop
AGENDA
Topic: Testing Big Data:
Automated ETL Testing of Hadoop
Host: RTTS
Date: Thursday, January 30, 2014
Time: 1:00 pm, Eastern Standard
Time (New York, GMT-05:00)
Session number:630 771 732
3. built by
QuerySurge™
About
FACTS
Founded:
1996
Locations:
New York (HQ), Atlanta,
Philadelphia, Phoenix
Strategic Partners:
IBM, Microsoft, HP,
Oracle, Teradata,
HortonWorks, Cloudera,
Amazon
Software:
QuerySurge
RTTS is the leading provider of software & data quality
for critical business systems
4. built by
Facebook handles 300 million photos a day and
about 105 terabytes of data every 30 minutes.
- TechCrunch
The big data market will grow from $3.2 billion in
2010 to $32.4 billion in 2017.
- Research Firm IDC
65% of…advanced analytics will have Hadoop
embedded (in them) by 2015.
-Gartner
built by
QuerySurge™
5. ETL
Business Intelligence (BI) software
CxOs are using Business Intelligence & Analytics to make critical business decisions
– with the assumption that the underlying data is fine.
“The average organization loses
$8.2 million annually through
poor Data Quality.”
- Gartner
Data Architecture
The Executive Office and Big Data
potential problem
areas
6. Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
about Big Data
built by
built by
QuerySurge™
7. Big Data Impact
Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
built by
QuerySurge™
8. Requires exceptional technologies to efficiently process large quantities of
data within tolerable elapsed times.
Technologies include:
• massively parallel processing (MPP) databases
• data warehouses
• Data mining grids
• distributed file systems
• distributed databases
• cloud computing platforms
• the Internet, and
• scalable storage system
Big Data Solutions
built by
QuerySurge™
9. built by
QuerySurge™
What is ?
• easily deals with complexities of high of data
Hadoop is an open source project that
develops software for scalable, distributed computing.
• is a of large data sets across
clusters of computers using simple programming models.
from single servers to 1,000’s of machines, each offering local
computation and storage.
• detects and at the application layer
10. built by
QuerySurge™
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
11. Top Vendors
built by
QuerySurge™
“Spending on Hadoop software and subscriptions will increase
to approximately $677 million by the end of 2017, with overall
big data market anticipated to reach the $50 billion mark.”
- Wikibon
13. built by
QuerySurge™
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Name Node
Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Name Node
Basic Hadoop Architecture (continued)
16. about Data Warehouses…
Data Warehouse
• typically a relational database that is designed for query and analysis rather than
for transaction processing
• a place where historical data is stored for archival, analysis & security purposes.
• contains either raw or formatted data
• combines data from multiple sources:
o sales
o salaries
o operational data
o human resource data
o inventory data
o web logs
o social networks
o internet text and docs
o other
built by
QuerySurge™
17. Data Warehouse: the ETL process
ETL: Extract, Transform, Load
Why ETL?
Need to load the data warehouse regularly (daily/weekly) so that it
can serve its purpose of facilitating business analysis.
Extract - data from one or more OLTP systems and copied into
the warehouse
Extract
Transform – removing inconsistencies, assemble to a common
format, adding missing fields, summarizing detailed data and
deriving new fields to store calculated data.
Transform
Load – map the data and load it into the DW
Load
built by
QuerySurge™
18. Data Warehouse: the Marketplace
“The data warehousing market will see a compound annual growth rate of
11.5% …to reach a total of $13.2 billion in revenue.”
- consulting specialist The 451 Group
Data Warehouse size
Small data warehouses: < 5 TB
Midsize data warehouses: 5 TB - 20 TB
Large data warehouses: >20 TB
- Analyst firm Gartner
Leaders in Data Warehouse Data Management Systems
- Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’
built by
QuerySurge™
22. built by
QuerySurge™
USE CASE 1***
Use Hadoop as a landing zone for big data & raw data
1) bring all raw, big data into Hadoop
2) perform some pre-processing of this data
3) determine which data goes to Data Warehouse
4) Extract, transform and load (ETL) pertinent data into Data Warehouse
Use Case #1:
Data Warehouse & Hadoop
***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013
built by
QuerySurge™
23. Recommended functional test strategy: Test every entry point in the system
(feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
built by
Business
Intelligence
software
ETL
Source Data
Source Hadoop ETL Process Target DWH
built by
QuerySurge™
Use Case #1:
Data Warehouse & Hadoop
test entry point test entry points
24. Use Case #2:
MongoDB, Hadoop, DWH &
Relational DB & Data
WarehousingSource Data
@
BI, Analytics &
ReportingIngestion
built by
™
™
test entry point
test entry point
test entry point
test entry point test entry point
25. built by
QuerySurge™
Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster
- we need to automate the testing
effort
- We need to be able to test across different platforms
We need a testing tool!
27. built by
QuerySurge™
What is QuerySurge™?
the collaborative
Big Data Testing solution
that finds bad data &
provides a holistic view of
your data’s health
built by
28. the QuerySurge advantage
built by
QuerySurge™
Automate the entire testing cycle
Automate kickoff, tests, comparison, auto-emailed results
Create Tests easily with no SQL programming
ensures minimal time & effort to create tests / obtain results
Test across different platforms
Hadoop, data warehouses, NoSQL, database, flat file, XML
Collaborate with team
Data Health dashboard, shared tests & auto-emailed reports
Verify more data & do it quickly
verifies up to 100% of all data up to 1,000 x faster
Integrate for Continuous Delivery
Integrates with most Build, ETL & QA management software
31. Fast and Easy.
No programming needed.
built by
QuerySurge™
QuerySurge™ Modules
• Perform 80% of all data tests -
no SQL coding needed
• Opens up testing to novices &
non-technical team members
• Speeds up testing for skilled SQL coders
• provides a huge Return-On-Investment
32. Design Library
• Create Query Pairs (source & target SQLs)
• Great for team members skilled with SQL
QuerySurge™ Modules
Scheduling
Build groups of Query Pairs
Schedule Test Runs
built by
QuerySurge™
33. Deep-Dive Reporting
Examine and automatically
email test results
Run Dashboard
View real-time execution
Analyze real-time results
QuerySurge™ Modules
built by
QuerySurge™
34. built by
QuerySurge™
• view data reliability & pass rate
• add, move, filter, zoom-in on any data
widget & underlying data
• verify build success or failure
QuerySurge™ Modules
35. (1) Trial in the Cloud of QuerySurgeTM, including self-learning
tutorial that works with sample data for 3 days
(2) Downloaded Trial of QuerySurgeTM, including self-learning
tutorial with sample data or your data for 15 days
for more information on our Trials, please visit:
www.querysurge.com/compare-trial-options
TRIAL
IN THE CLOUD
built by
QuerySurge™
Free Trials & TrainingQuerySurge™
http://www.rttsweb.com/training/courses/big-data-testing-courses
Big Data Testing Courses
Filled with examples and labs, this hands-on training teaches concepts
and HQL techniques used in Big Data testing.
For more information on our Big Data Testing classes, please visit:
36. a last word about Hadoop…
built by
built by
QuerySurge™
To see the video of this webinar please visit:
http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop
Big Data and Hadoop are on the verge of revolutionizing
enterprise data management architectures.
- DeZyre
Notes de l'éditeur
Largest known cluster is 4500 nodes
Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project.
Many data warehousing projects use ETL tools to manage this process. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database.
Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements.
Informatica’s software is the premier used for ETL, but was not mentioned in Gartner’s report because they don’t have DW software.
QuerySurge provides insight into the health of your data throughout your organization through BI dashboards and reporting at your fingertips. It is a collaborative tool that allows for distributed use of the tool throughout your organization and provides for a sharable, holistic view of your data’s health and your organization’s level of maturity of your data management.
Your distributed team from around the world can use any of these web browsers: Internet Explorer, Chrome, Firefox and Safari.
Installs on operating systems: Windows & Linux.
QS connects to any JDBC-compliant data source. Even if it is not listed here.