SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
Benchmarking 
Hadoop & Big Data benchmarking 
Dr. ir. ing. Bart Vandewoestyne 
Sizing Servers Lab, Howest, Kortrijk 
IWT TETRA User Group Meeting - November 28, 2014 
1 / 62
Benchmarking 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
2 / 62
Benchmarking 
Intro: Hadoop essentials 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
3 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 
Hadoop is VMware, but the other way around. 
4 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 1.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
MapReduce and HDFS are the 
core components, while other 
components are built around the 
core. 
5 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop 2.0 
Source: Apache Hadoop YARN : moving beyond 
MapReduce and batch processing with Apache Hadoop 2, 
Hortonworks, 2014) 
YARN adds a more general 
interface to run non-MapReduce 
jobs within the Hadoop 
framework. 
6 / 62
Benchmarking 
Intro: Hadoop essentials 
HDFS 
Hadoop Distributed File System 
Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 
7 / 62
Benchmarking 
Intro: Hadoop essentials 
MapReduce 
MapReduce = Programming Model 
WordCount example: 
Source: Optimizing Hadoop for MapReduce, Khaled Tannir 
8 / 62
Benchmarking 
Intro: Hadoop essentials 
Hadoop distributions 
9 / 62
Benchmarking 
Cloudera demo 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
10 / 62
Benchmarking 
Cloudera demo 
HDFS 
11 / 62
Benchmarking 
Cloudera demo 
NameNode and DataNodes 
12 / 62
Benchmarking 
Cloudera demo 
Hosts and their roles 
13 / 62
Benchmarking 
Cloudera demo 
NameNode WebUI 
NameNode WebUI address 
http://sandy-quad-1.sslab.lan:50070/ 
14 / 62
Benchmarking 
Cloudera demo 
Replication factor 
15 / 62
Benchmarking 
Cloudera demo 
HDFS Blocks 
16 / 62
Benchmarking 
Cloudera demo 
Hue:
le upload 
17 / 62
Benchmarking 
Cloudera demo 
Hadoop jobs: counters/metrics 
18 / 62
Benchmarking 
Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
19 / 62
Benchmarking 
Benchmarks 
Why benchmark? 
My three reasons for using benchmarks: 
1 Evaluating the eect of a hardware/software upgrade: 
OS, Java VM,. . . 
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 
2 Debugging: 
Compare with other clusters or published results. 
3 Performance tuning: 
E.g. Cloudera CDH default con
g is defensive, not optimal. 
20 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
21 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Hadoop: Available tests 
hadoop jar /some/path/to/hadoop-*test*.jar 
22 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO 
Read and write test for HDFS. 
Helpful for 
getting an idea of how fast your cluster is in terms of I/O, 
stress testing HDFS, 
discover network performance bottlenecks, 
shake out the hardware, OS and Hadoop setup of your cluster 
machines (particularly the NameNode and the DataNodes). 
23 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test 
Generate 10
les of size 1 GB for a total of 10 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -write -nrFiles 10 -fileSize 1000 
TestDFSIO is designed to use 1 map task per
le 
(1:1 mapping from
les to map tasks) 
24 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: write test output 
Typical output of write test 
----- TestDFSIO ----- : write 
Date  time: Mon Oct 06 10:21:28 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 12.874702111579893 
Average IO rate mb/sec: 13.013071060180664 
IO rate std deviation: 1.4416050051562712 
Test exec time sec: 114.346 
25 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
Interpreting TestDFSIO results 
De
nition (Throughput) 
Throughput(N) = 
PN 
i=0
lesizei PN 
i=0 timei 
De
nition (Average IO rate) 
Average IO rate(N) = 
PN 
i=0 ratei 
N 
= 
PN
lesizei 
timei 
N 
i=0 
Here, N is the number of map tasks. 
26 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test 
Read 10 input
les, each of size 1 GB: 
$ hadoop jar hadoop-*test*.jar  
TestDFSIO -read -nrFiles 10 -fileSize 1000 
27 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TestDFSIO: read test output 
Typical output of read test 
----- TestDFSIO ----- : read 
Date  time: Mon Oct 06 10:56:15 CEST 2014 
Number of files: 10 
Total MBytes processed: 10000.0 
Throughput mb/sec: 402.4306813151435 
Average IO rate mb/sec: 492.8257751464844 
IO rate std deviation: 196.51233829270575 
Test exec time sec: 33.206 
28 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
In
uence of HDFS replication factor 
When interpreting TestDFSIO results, keep in mind: 
The HDFS replication factor plays an important role! 
A higher replication factor leads to slower writes. 
For three identical TestDFSIO write runs (units are MB/s): 
HDFS replication factor 
1 2 3 
Throughput 190 25 13 
Average IO-rate 190  10 25  3 13  1 
29 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort 
Goal 
Sort 1TB of data (or any other amount of data) as fast as possible. 
Probably most well-known Hadoop benchmark. 
Combines testing the HDFS and MapReduce layers of an 
Hadoop cluster. 
Typical areas where TeraSort is helpful 
Iron out your Hadoop con
guration after your cluster passed a 
convincing TestDFSIO benchmark
rst. 
Determine whether your MapReduce-related parameters are 
set to proper values. 
30 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
TeraGen 
/user/bart/terasort-input 
TeraSort 
/user/bart/terasort-output 
TeraValidate 
/user/bart/terasort-validate 
31 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
32 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
33 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: work
ow 
hadoop jar hadoop-mapreduce-examples.jar  
teragen 10000000000 /user/bart/input 
 4 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
terasort /user/bart/input /user/bart/output 
 5 hours on our 4-node cluster 
hadoop jar hadoop-mapreduce-examples.jar  
teravalidate /user/bart/output /user/bart/validate 
If something went wrong, TeraValidate's output contains the 
problem report. 
34 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: duration 
35 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
TeraSort: counters 
36 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench 
Goal 
Load test the NameNode hardware and software. 
Generates a lot of HDFS-related requests with normally very 
small payloads. 
Purpose: put a high HDFS management stress on the 
NameNode. 
Can simulate requests for creating, reading, renaming and 
deleting
les on HDFS. 
37 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
NNBench: example 
Create 1000
les using 12 maps and 6 reducers: 
$ hadoop jar hadoop-*test*.jar nnbench  
-operation create_write  
-maps 12  
-reduces 6  
-blockSize 1  
-bytesToWrite 0  
-numberOfFiles 1000  
-replicationFactorPerFile 3  
-readFileAfterOpen true  
-baseDir /user/bart/NNBench-`hostname -s` 
38 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench 
Goal 
Loop a small job a number of times. 
checks whether small job runs are responsive and running 
eciently on the cluster 
complimentary to TeraSort 
puts its focus on the MapReduce layer 
impact on the HDFS layer is very limited 
39 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
40 / 62
Benchmarking 
Benchmarks 
Micro Benchmarks 
MRBench: example 
Run a loop of 50 small test jobs: 
$ hadoop jar hadoop-*test*.jar  
mrbench -baseDir /user/bart/MRBench  
-numRuns 50 
Example output: 
DataLines Maps Reduces AvgTime (milliseconds) 
1 2 1 28822 
! average
nish time of executed jobs was 28 seconds. 
41 / 62
Benchmarking 
Benchmarks 
BigBench 
Outline 
1 Intro: Hadoop essentials 
2 Cloudera demo 
3 Benchmarks 
Micro Benchmarks 
BigBench 
4 Conclusions 
42 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 
43 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench 
Big Data benchmark based on TPC-DS. 
Focus is mostly on MapReduce engines. 
Collaboration between industry and academia. 
https://github.com/intel-hadoop/Big-Bench/ 
History 
Launched at First Workshop on Big Data Benchmarking 
(May 8-9, 2012). 
Full kit at Fifth Workshop on Big Data Benchmarking 
(August 5-6, 2014). 
44 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench data model 
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 
45 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Data Model - 3 V's 
Variety 
BigBench data is 
structured, 
semi-structured, 
unstructured. 
Velocity 
Periodic refreshes for all data. 
Dierent velocity for dierent areas: 
Vstructured  Vunstructured  Vsemistructured 
Volume 
TPC-DS: discrete scale factors 
(100, 300, 1000, 3000, 10000, 3000 and 100000). 
BigBench: continuous scale factor. 
46 / 62
Benchmarking 
Benchmarks 
BigBench 
BigBench: Workload 
Workload queries 
30 queries 
Speci

Contenu connexe

Tendances

ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016) ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016) Erica Beavers
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : SparkLilia Sfaxi
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform LoadABDUL KHALIQ
 
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...Sehrish Asif
 

Tendances (20)

Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016) ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016)
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Sqoop
SqoopSqoop
Sqoop
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Ipc
IpcIpc
Ipc
 
Draw and explain the architecture of general purpose microprocessor
Draw and explain the architecture of general purpose microprocessor Draw and explain the architecture of general purpose microprocessor
Draw and explain the architecture of general purpose microprocessor
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Data Modeling with Power BI
Data Modeling with Power BIData Modeling with Power BI
Data Modeling with Power BI
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : Spark
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache hive
Apache hiveApache hive
Apache hive
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
 
hive lab
hive labhive lab
hive lab
 

En vedette

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپMobin Ranjbar
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

En vedette (17)

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similaire à Hadoop Benchmarking Guide

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outspardhavi reddy
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetVasyl Senko
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...inovex GmbH
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 

Similaire à Hadoop Benchmarking Guide (20)

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
H04502048051
H04502048051H04502048051
H04502048051
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 

Dernier

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Hadoop Benchmarking Guide

  • 1. Benchmarking Hadoop & Big Data benchmarking Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk IWT TETRA User Group Meeting - November 28, 2014 1 / 62
  • 2. Benchmarking Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 2 / 62
  • 3. Benchmarking Intro: Hadoop essentials Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 3 / 62
  • 4. Benchmarking Intro: Hadoop essentials Hadoop Hadoop is VMware, but the other way around. 4 / 62
  • 5. Benchmarking Intro: Hadoop essentials Hadoop 1.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) MapReduce and HDFS are the core components, while other components are built around the core. 5 / 62
  • 6. Benchmarking Intro: Hadoop essentials Hadoop 2.0 Source: Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2, Hortonworks, 2014) YARN adds a more general interface to run non-MapReduce jobs within the Hadoop framework. 6 / 62
  • 7. Benchmarking Intro: Hadoop essentials HDFS Hadoop Distributed File System Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx 7 / 62
  • 8. Benchmarking Intro: Hadoop essentials MapReduce MapReduce = Programming Model WordCount example: Source: Optimizing Hadoop for MapReduce, Khaled Tannir 8 / 62
  • 9. Benchmarking Intro: Hadoop essentials Hadoop distributions 9 / 62
  • 10. Benchmarking Cloudera demo Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 10 / 62
  • 12. Benchmarking Cloudera demo NameNode and DataNodes 12 / 62
  • 13. Benchmarking Cloudera demo Hosts and their roles 13 / 62
  • 14. Benchmarking Cloudera demo NameNode WebUI NameNode WebUI address http://sandy-quad-1.sslab.lan:50070/ 14 / 62
  • 15. Benchmarking Cloudera demo Replication factor 15 / 62
  • 16. Benchmarking Cloudera demo HDFS Blocks 16 / 62
  • 18. le upload 17 / 62
  • 19. Benchmarking Cloudera demo Hadoop jobs: counters/metrics 18 / 62
  • 20. Benchmarking Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 19 / 62
  • 21. Benchmarking Benchmarks Why benchmark? My three reasons for using benchmarks: 1 Evaluating the eect of a hardware/software upgrade: OS, Java VM,. . . Hadoop, Cloudera CDH, Pig, Hive, Impala,. . . 2 Debugging: Compare with other clusters or published results. 3 Performance tuning: E.g. Cloudera CDH default con
  • 22. g is defensive, not optimal. 20 / 62
  • 23. Benchmarking Benchmarks Micro Benchmarks Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 21 / 62
  • 24. Benchmarking Benchmarks Micro Benchmarks Hadoop: Available tests hadoop jar /some/path/to/hadoop-*test*.jar 22 / 62
  • 25. Benchmarking Benchmarks Micro Benchmarks TestDFSIO Read and write test for HDFS. Helpful for getting an idea of how fast your cluster is in terms of I/O, stress testing HDFS, discover network performance bottlenecks, shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes). 23 / 62
  • 26. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test Generate 10
  • 27. les of size 1 GB for a total of 10 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 TestDFSIO is designed to use 1 map task per
  • 29. les to map tasks) 24 / 62
  • 30. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: write test output Typical output of write test ----- TestDFSIO ----- : write Date time: Mon Oct 06 10:21:28 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 12.874702111579893 Average IO rate mb/sec: 13.013071060180664 IO rate std deviation: 1.4416050051562712 Test exec time sec: 114.346 25 / 62
  • 31. Benchmarking Benchmarks Micro Benchmarks Interpreting TestDFSIO results De
  • 33. lesizei PN i=0 timei De
  • 34. nition (Average IO rate) Average IO rate(N) = PN i=0 ratei N = PN
  • 35. lesizei timei N i=0 Here, N is the number of map tasks. 26 / 62
  • 36. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test Read 10 input
  • 37. les, each of size 1 GB: $ hadoop jar hadoop-*test*.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 27 / 62
  • 38. Benchmarking Benchmarks Micro Benchmarks TestDFSIO: read test output Typical output of read test ----- TestDFSIO ----- : read Date time: Mon Oct 06 10:56:15 CEST 2014 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 402.4306813151435 Average IO rate mb/sec: 492.8257751464844 IO rate std deviation: 196.51233829270575 Test exec time sec: 33.206 28 / 62
  • 39. Benchmarking Benchmarks Micro Benchmarks In uence of HDFS replication factor When interpreting TestDFSIO results, keep in mind: The HDFS replication factor plays an important role! A higher replication factor leads to slower writes. For three identical TestDFSIO write runs (units are MB/s): HDFS replication factor 1 2 3 Throughput 190 25 13 Average IO-rate 190 10 25 3 13 1 29 / 62
  • 40. Benchmarking Benchmarks Micro Benchmarks TeraSort Goal Sort 1TB of data (or any other amount of data) as fast as possible. Probably most well-known Hadoop benchmark. Combines testing the HDFS and MapReduce layers of an Hadoop cluster. Typical areas where TeraSort is helpful Iron out your Hadoop con
  • 41. guration after your cluster passed a convincing TestDFSIO benchmark
  • 42. rst. Determine whether your MapReduce-related parameters are set to proper values. 30 / 62
  • 43. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow TeraGen /user/bart/terasort-input TeraSort /user/bart/terasort-output TeraValidate /user/bart/terasort-validate 31 / 62
  • 44. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster 32 / 62
  • 45. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster 33 / 62
  • 46. Benchmarking Benchmarks Micro Benchmarks TeraSort: work ow hadoop jar hadoop-mapreduce-examples.jar teragen 10000000000 /user/bart/input 4 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar terasort /user/bart/input /user/bart/output 5 hours on our 4-node cluster hadoop jar hadoop-mapreduce-examples.jar teravalidate /user/bart/output /user/bart/validate If something went wrong, TeraValidate's output contains the problem report. 34 / 62
  • 47. Benchmarking Benchmarks Micro Benchmarks TeraSort: duration 35 / 62
  • 48. Benchmarking Benchmarks Micro Benchmarks TeraSort: counters 36 / 62
  • 49. Benchmarking Benchmarks Micro Benchmarks NNBench Goal Load test the NameNode hardware and software. Generates a lot of HDFS-related requests with normally very small payloads. Purpose: put a high HDFS management stress on the NameNode. Can simulate requests for creating, reading, renaming and deleting
  • 50. les on HDFS. 37 / 62
  • 51. Benchmarking Benchmarks Micro Benchmarks NNBench: example Create 1000
  • 52. les using 12 maps and 6 reducers: $ hadoop jar hadoop-*test*.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /user/bart/NNBench-`hostname -s` 38 / 62
  • 53. Benchmarking Benchmarks Micro Benchmarks MRBench Goal Loop a small job a number of times. checks whether small job runs are responsive and running eciently on the cluster complimentary to TeraSort puts its focus on the MapReduce layer impact on the HDFS layer is very limited 39 / 62
  • 54. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 40 / 62
  • 55. Benchmarking Benchmarks Micro Benchmarks MRBench: example Run a loop of 50 small test jobs: $ hadoop jar hadoop-*test*.jar mrbench -baseDir /user/bart/MRBench -numRuns 50 Example output: DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 28822 ! average
  • 56. nish time of executed jobs was 28 seconds. 41 / 62
  • 57. Benchmarking Benchmarks BigBench Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 42 / 62
  • 58. Benchmarking Benchmarks BigBench BigBench Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index 43 / 62
  • 59. Benchmarking Benchmarks BigBench BigBench Big Data benchmark based on TPC-DS. Focus is mostly on MapReduce engines. Collaboration between industry and academia. https://github.com/intel-hadoop/Big-Bench/ History Launched at First Workshop on Big Data Benchmarking (May 8-9, 2012). Full kit at Fifth Workshop on Big Data Benchmarking (August 5-6, 2014). 44 / 62
  • 60. Benchmarking Benchmarks BigBench BigBench data model Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013. 45 / 62
  • 61. Benchmarking Benchmarks BigBench BigBench: Data Model - 3 V's Variety BigBench data is structured, semi-structured, unstructured. Velocity Periodic refreshes for all data. Dierent velocity for dierent areas: Vstructured Vunstructured Vsemistructured Volume TPC-DS: discrete scale factors (100, 300, 1000, 3000, 10000, 3000 and 100000). BigBench: continuous scale factor. 46 / 62
  • 62. Benchmarking Benchmarks BigBench BigBench: Workload Workload queries 30 queries Speci
  • 63. ed in English (sort of) No required syntax (
  • 64. rst implementation in Aster SQL MR) Kit implemented in Hive, Hadoop MR, Mahout, OpenNLP Business functions (McKinsey) Marketing Merchandising Operations Supply chain Reporting (customers and products) 47 / 62
  • 65. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Data Sources Number of Queries Percentage Structured 18 60 % Semi-structured 7 23 % Unstructured 5 17 % Analytic techniques Number of Queries Percentage Statistics analysis 6 20 % Data mining 17 57 % Reporting 8 27 % 48 / 62
  • 66. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % 49 / 62
  • 67. Benchmarking Benchmarks BigBench BigBench: Workload - Technical Aspects Query Types Number of Queries Percentage Pure HiveQL 14 46 % Mahout 5 17 % OpenNLP 5 17 % Custom MR 6 20 % Note that your implementation may vary! 50 / 62
  • 68. Benchmarking Benchmarks BigBench BIgBench: Benchmark Process Source: http://www.tele-task.de/archive/video/flash/24896/ 51 / 62
  • 69. Benchmarking Benchmarks BigBench BigBench: Metric Number of queries run: 30 (2 S + 1) Measured times: TL: loading process TP: power test TTT1 :
  • 70. rst throughput test TTDM : data maintenance task TTT2 : second throughput test De
  • 71. nition (BigBench queries per hour) BBQpH = 30 3 S 3600 S TL + S TP + TTT1 + S TTDM + TTT2 Similar to TPC-DS metric. 52 / 62
  • 72. Benchmarking Benchmarks BigBench BigBench: results 53 / 62
  • 73. Benchmarking Benchmarks BigBench BigBench: monitoring 54 / 62
  • 74. Benchmarking Benchmarks BigBench BigBench: monitoring 55 / 62
  • 75. Benchmarking Benchmarks BigBench BigBench: monitoring 56 / 62
  • 76. Benchmarking Benchmarks BigBench BigBench: monitoring 57 / 62
  • 77. Benchmarking Benchmarks BigBench BigBench: in progress 58 / 62 Source: The Hortonworks Blog
  • 78. Benchmarking Conclusions Outline 1 Intro: Hadoop essentials 2 Cloudera demo 3 Benchmarks Micro Benchmarks BigBench 4 Conclusions 59 / 62
  • 79. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. 60 / 62
  • 80. Benchmarking Conclusions Conclusions Use Hadoop distributions! Hadoop cluster administration ! Cloudera Manager. Micro-benchmarks $ BigBench. Your best benchmark is your own application! 61 / 62
  • 81. Benchmarking Conclusions Questions? Source: https://gigaom.com/2011/12/19/my-hadoop-is-bigger-than-yours/ 62 / 62