3. 3Pivotal Confidential–Internal Use Only
About Me (https://www.linkedin.com/in/vgogate)
Since 2008, active in Hadoop infrastructure and ecosystem components
with lead Hadoop technology based companies
– Yahoo, Netflix, Hortonworks, EMC-Greenplum/Pivotal
Founder and PMC member/committer of the Apache Ambari project
Contributed Apache “Hadoop Vaidya” – Performance diagnostics for M/R
Prior to Hadoop,
– IBM Almaden Research (2000-2008), CS Software & Storage systems.
– In early days (1993) of my career, worked with a team that built first Indian
super computer, PARAM (Transputer based MPP system) at Center for
Development of Advance Computing (CDAC, Pune)
4. 4Pivotal Confidential–Internal Use Only
Agenda
Introduction to Hadoop 2.0
– HDFS, YARN, Map/Reduce (M/R)
Hadoop Cluster
– Hardware selection & Capacity Planning
Key Hadoop Configuration Parameters
– Operating System, Local FS, HDFS, YARN
Performance Tuning of M/R Applications
– Hadoop Vaidya demo (MAPREDUCE-3202)
11. 11Pivotal Confidential–Internal Use Only
Hadoop Cluster: Hardware selection
Hadoop runs on cluster of commodity machines
– Does NOT mean unreliable & low-cost hardware
– 2008:
▪ Single/Dual socket, 1+ GHz, 4-8GB RAM, 2-4 Cores, 2-4 1TB SATA drives, 1GbE NIC
– 2014+:
▪ Dual socket 2+ GHz, 48-64GB RAM, 12-16 Cores, 12-16 1/3TB SATA drives, 2-4 bonded NICs
Hadoop cluster comprises separate h/w profiles for,
– Master nodes
▪ More reliable & available configuration
▪ Low storage/memory requirements compared to worker/data nodes (except HDFS Name node)
– Worker nodes
▪ Not necessarily less reliable although not configured for high availability (e.g. JBOD not RAID)
▪ Resource requirements on storage, memory, CPU & network b/w are based on workload profile
See Appliance/Reference architectures from EMC/IBM/Cisco/HP/Dell etc
12. 12Pivotal Confidential–Internal Use Only
Cluster capacity planning
Initial cluster capacity is commonly based on HDFS data size & growth projection
– Keep data compression in mind
Cluster size
– HDFS space per node = (Raw disk space per node – 20/25% non-DFS local storage) / 3 (RF)
– Cluster nodes = Total HDFS space / HDFS space per node
Start with balanced individual node configuration in terms of CPU/Memory/Number of disks
– Keep provision for growth as you learn more about the workload
– Guidelines for individual worker node configuration
▪ Latest generation processor(s) with 12/16 cores total, is a reasonable start
▪ 4 to 6 GB memory per core
▪ 1/1.5 disks per core
▪ 1 to 3 TB SATA disks per core
▪ 1GbE NIC
Derive resource requirements service roles (typically Name node)
– NN -> 4-6 cores, default 2GB + 1 GB roughly every 100TB of raw disk space, 1M Objects
– RM/NN/DN -> 2GB RAM, 1 core (thumb rule)
13. 13Pivotal Confidential–Internal Use Only
Ongoing Capacity Planning
Workload profiling
– Make sure applications running on the cluster are well tuned to utilize cluster
resources
– Gather average CPU/IO/Mem/Disk utilization stats across worker nodes
– Identify the resource bottlenecks for your workload and provision accordingly
▪ E,g, bump up more cores per node if CPU is bottleneck compared to other resources such as storage, I/O
bandwidth
▪ E.g. If memory is a bottleneck (i.e. low utilization on CPU and I/O), then add more memory per node to let
more tasks run per node.
– HDFS storage growth rate & current capacity utilization
Latency sensitive applications
– Do planned project on-boarding
▪ Estimate resource requirement ahead of time (Capacity Calculator)
– Use Resource Scheduler
▪ Schedule the jobs to maximize the resource utilization over the time
15. 15Pivotal Confidential–Internal Use Only
Key Hadoop Cluster Configuration – OS
Operating System (RHEL/CentOS 6.1+)
– Mount disk volumes with NOATIME (speed up reads)
– Disable transparent Huge page compaction
▪ # echo never > /sys/kernel/mm/redhat_transparent_hugepages/defrag
– Turn off caching on disk controller
– vm.swappiness = 0
– vm.overcommit_memory = 1
– vm.overcommit_ratio = 100
– net.core.somaxconn=1024 (default socket listen queue size 128)
– Choice of Linux I/O scheduler
Local File System
– Ext3 (reliable and recommended) vs Ext4 vs XFS
– Default max open file descriptors per user (default 1K may change to 32K)
– Reduce FS reserve blocks space (default 5% -> 0% on non-os partitions)
BIOS
– Disable BIOS power saving options may boost node performance
16. 16Pivotal Confidential–Internal Use Only
Key Hadoop Cluster Configuration - HDFS
Use multiple disk mount points
– dfs.datanode.data.dir (use all attached disks to data node)
– dfs.namenode.name.dir (NN metadata redundancy on disk)
DFS Block size
– 128MB (you can override it while writing new files with different block size)
Local file system buffer
– io.file.buffer.size = 131072 (128KB)
– Io.sort.factor = 50 to 100 (number merge streams while sorting file)
NN/DN concurrency
– dfs.namenode.handler.count (100)
– dfs.datanode.max.transfer.threads (4096)
Datanode Failed volumes tolerated
– dfs.datanode.failed.volumes.tolerated
18. 18Pivotal Confidential–Internal Use Only
Key Hadoop Cluster Configuration - YARN
Use multiple disk mount points on the worker node
– yarn.nodemanager.local-dirs
– yarn.nodemanager.log-dirs
Memory allocated for node manager containers
– yarn.nodemanager.resource.memory-mb
▪ Total memory on worker node allocated for all the containers running in parallel
– yarn.scheduler.minimum-allocation-mb
▪ Minimum memory requested for map/reduce task container.
▪ Based on CPU cores and available memory on the node, this parameter can limit number of max containers
per node
– yarn.nodemanager.maximum-allocation-mb
▪ Default to yarn.nodemanager.resource.memory-mb
– yarn.mapreduce.map.memory.mb, yarn.mapreduce.reduce.memory.mb
▪ Default values for map/reduce task container memory. User can override them through job configuration
– yarn.nodemanager.vmem-pmem-ratio = 2.1
20. 20Pivotal Confidential–Internal Use Only
Hadoop Configuration Advisor - Tool
Given
– Data size, Growth rate, Workload profile, Latency/QoS requirements
Suggests
– Capacity requirement for Hadoop cluster (reasonable starting point)
▪ Resource requirements for various services roles
▪ Hardware profiles for master/worker nodes (No specific h/w vendor )
– Cluster services topology i.e. placement of service roles to nodes
– Optimal services configuration for given hardware specs
23. 23Pivotal Confidential–Internal Use Only
Optimizing M/R applications – key features
Speculative Execution
Use of Combiner
Data Compression
– Intermediate: LZO(native)/Snappy, Output: BZip2, Gzip
Avoid map side disk spills
– io.sort.mb
Increased replication factor for out-of-band Hdfs access
Distributed cache
Map output partitioner
Appropriate granularity for M/R tasks
– Mapreduce.map.minsplitsize
– Mapreduce.map.maxsplitsize
– Optimal number of reducers
24. 24Pivotal Confidential–Internal Use Only
Performance Benchmark – Teragen
Running teragen out-of-box will not utilize the cluster hardware
resources
Determine number of map tasks to run on each node to exploit
max I/O bandwidth
– Depends on number of disks on each node
Example
– 10 nodes, 5 disks/node, per node memory for m/r tasks 50GB
– Hadoop jar hadoop-mapreduce-examples-2.x.x.jar teragen
– -Dmapred.map.tasks=50 -Dmapreduce.map.memory.mb=10GB
– 10000000000 /teragenoutput
28. 28Pivotal Confidential–Internal Use Only
Hadoop Vaidya: Rule based performance diagnostic tool
• Rule based performance diagnosis of M/R jobs
– Set of pre-defined diagnostic rules
– Diagnostic rules execution against job config &
Job History logs
– Targeted advice for discovered problems
• Extensible framework
– You can add your own rules,
• Based on a rule template and published
job counters
– Write complex rules using existing simpler
rules
Vaidya: An expert (versed in his own
profession , esp. in medical science) ,
skilled in the art of healing , a physician
29. 29Pivotal Confidential–Internal Use Only
Hadoop Vaidya: Diagnostic Test Rule
<DiagnosticTest>
<Title>Balanaced Reduce Partitioning</Title>
<ClassName>
org.apache.hadoop.vaidya.postexdiagnosis.tests.BalancedReducePartitioning
</ClassName>
<Description>
This rule tests as to how well the input to reduce tasks is balanced
</Description>
<Importance>High</Importance>
<SuccessThreshold>0.40</SuccessThreshold>
<Prescription>advice</Prescription>
<InputElement>
<PercentReduceRecords>0.85</PercentReduceRecords>
</InputElement>
</DiagnosticTest>
30. 30Pivotal Confidential–Internal Use Only
Hadoop Vaidya: Report Element
<TestReportElement>
<TestTitle>Balanaced Reduce Partitioning</TestTitle>
<TestDescription>
This rule tests as to how well the input to reduce tasks is balanced
</TestDescription>
<TestImportance>HIGH</TestImportance>
<TestResult>POSITIVE(FAILED)</TestResult>
<TestSeverity>0.69</TestSeverity>
<ReferenceDetails>
* TotalReduceTasks: 4096
* BusyReduceTasks processing 0.85% of total records: 3373
* Impact: 0.70
</ReferenceDetails>
<TestPrescription>
* Use the appropriate partitioning function
* For streaming job consider following partitioner and hadoop config parameters
* org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
* -jobconf stream.map.output.field.separator, -jobconf stream.num.map.output.key.fields
</TestPrescription>
</TestReportElement>
31. 31Pivotal Confidential–Internal Use Only
Hadoop Vaidya: Example Rules
Balanced Reduce Partitioning
– Check if intermediate data is well partitioned among reducers.
Map/Reduce tasks reading HDFS files as side effect
– Checks if HDFS files are being read as side effect and causing the access bottleneck across
map/reduce tasks
Percent Re-execution of Map/Reduce tasks
Granularity of Map/Reduce task execution
Map tasks data locality
– This rule detects % data locality for Map tasks
Use of Combiner & Combiner efficiency
– Checks if there is a potential in using combiner after map stage
Intermediate data compression
– Checks if intermediate data is compressed to lower the shuffle time
32. 32Pivotal Confidential–Internal Use Only
Hadoop Vaidya: Demo
Thank you!
• For Hadoop Vaidya demo
• Download Pivotal HD single node VM
(https://network.gopivotal.com/products/pivotal-hd)
• Run M/R job
• After job completed, go Resource Manager UI
• Select Job History
• To look at the diagnostic report click on “Vaidya Report” link in left
menu
• Vaidya Patch submitted to Apache Hadoop
• https://issues.apache.org/jira/browse/MAPREDUCE-3202