SlideShare une entreprise Scribd logo
1  sur  21
© Hortonworks Inc. 2013
Top 10 things to get the most
out of your Hadoop Cluster
Suresh Srinivas | @suresh_m_s
Sanjay Radia | @srr
Page 1
© Hortonworks Inc. 2013
About Me
• Architect & Founder at Hortonworks
• Long time Apache Hadoop committer and PMC
member
• Designed and developed many key Hadoop features
• Experience from supporting many clusters
–Including some of the world’s largest Hadoop clusters
Page 2
© Hortonworks Inc. 2013
Agenda
Best Practices, Tips and Tricks for
• Building cluster
• Configuration
• Monitoring
• Reliability
• Multi-tenancy
Page 3
© Hortonworks Inc. 2013
Hardware and Cluster Sizing
• Considerations
–Larger clusters heal faster on nodes or disk failure
–Machines with huge storage take longer to recover
–More racks give more failure domains
• Recommendations
– Get good-quality commodity hardware
– Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores
– More memory is better – real time is memory hungry!
– Before considering fatter machines (1U 6 disks vs. 2U 12 disks)
– Get to 30-40 machines or 3-4 racks
–Use pilot cluster to learn about load patterns
– Balanced hardware for I/O, compute or memory bound
–Rule of thumb – network to compute cost of 20%
–More details - http://tinyurl.com/hwx-hadoop-hw
Page 4
© Hortonworks Inc. 2013
Configuration is Key
• Avoid JVM issues
–Use 64 bit JVM for all daemons
– Compressed OOPS enabled by default (6 u23 and later)
–Java heap size
– Set same max and starting heapsize, Xmx == Xms
– Avoid java defaults – configure NewSize and MaxNewSize
– Use 1/8 to 1/6 of max size for JVMs larger than 4G
–Use low-latency GC collector
– -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N>
– High <N> on Namenode and JobTracker
–Important JVM configs to help debugging
– -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails
– -XX:ErrorFile=<file>
– -XX:+HeapDumpOnOutOfMemoryError
Page 5
© Hortonworks Inc. 2013
Configuration is Key…
• Multiple redundant dirs for namenode metadata
–One of dfs.name.dir should be on NFS
–NFS softmount - tcp,soft,intr,timeo=20,retrans=5
• Configure open fd ulimit
–Default 1024 is too low
–16K for datanodes, 64K for Master nodes
• Setup cluster nodes with time synchronization
• Use version control for configuration!
Page 6
© Hortonworks Inc. 2013
Configuration is Key…
• Use disk fail in place for datanodes
–Disk failure is no longer datanode failure
–Especially important for large density nodes
• Set dfs.namenode.name.dir.restore to true
–Restores NN storage directory during checkpointing
• Take periodic backups of namenode metadata
–Make copies of the entire storage directory
• Master node OS device should be highly available
–RAID-1 (mirrored pair)
• Set aside a lot of disk space for NN logs
–It is verbose – set aside multiple GBs
–Many installs configure this too small
– NN logs roll with in minutes – hard to debug issues
Page 7
© Hortonworks Inc. 2013
Checkpointing
• Secondary Namenode - confusing name
Page 8
© Hortonworks Inc. 2013
Checkpointing…
• Setup a single secondary namenode
–Periodically merges file system image with journal
–Two secondary namenodes not supported
– Many instances of accidental two secondary namenodes
– Known to cause metadata corruption!
• In HA setup standby replaces secondary
• Ensure periodic checkpoints are happening
–Checkpoint time can be queried in scripts
– Shown in NN webUI as well
–Real incident
– A cluster was run for more than a year with no checkpoint!
– Namenode stopped when it ran out of disk space
– NN was running for more than an year – no restart!!!
– Restoring the cluster was not fun!
Page 9
© Hortonworks Inc. 2013
Don’t edit the metadata files!
• Editing can corrupt the cluster state
–Might result in loss of data
• Real incident
–NN misconfigured to point to another NN’s metadata
–DNs can’t register due to namespace ID mismatch
– System detected the problem correctly
– Safety net ignored by the admin!
–Admin edits the namenode VERSION file to match ids
What Happens Next?
Page 10
© Hortonworks Inc. 2013
Guard Against Accidental Deletion
• rm –r deletes the data at the speed of Hadoop!
–ctrl-c of the command does not stop deletion!
–Undeleting files on datanodes is hard & time consuming
– Immediately shutdown NN, unmount disks on datanodes
– Recover deleted files
– Start namenode without the delete operation in edits
• Enable Trash
• Real Incident
–Customer is running a distro of Hadoop with trash not enabled
–Deletes a large dir (100 TB) and shuts down NN immediately
–Support person asks NN to be restarted to see if trash is enabled!
What happens next?
• Now HDFS has Snapshots!
Page 11
© Hortonworks Inc. 2013
Monitor Usage
• Cluster storage, nodes, files, blocks grows
– Update NN heap, handler count, number of DN xceivers
– Tweak other related config periodically
• Monitor the hardware usage for your work load
– Disk I/O, network I/O, CPU and memory usage
– Use this information when expanding cluster capacity
• Monitor the usage with HADOOP metrics
– JVM metrics – GC times, Memory used, Thread Status
– RPC metrics – especially latency to track slowdowns
– HDFS metrics
– Used storage, # of files and blocks, total load on the cluster
– File System operations
– MapReduce Metrics
– Slot utilization and Job status
• Tweak configurations during upgrades/maintenance
Page 12
© Hortonworks Inc. 2013
Monitoring Simplified With Ambari
Cluster Metrics Summary
Page 13
© Hortonworks Inc. 2013
Monitoring Simplified With Ambari
HDFS Metrics Summary
Page 14
© Hortonworks Inc. 2013
Monitoring Simplified With Ambari
MapReduce Metrics Summary
Page 15
© Hortonworks Inc. 2013
Monitor Failures
• If a large % of datanodes fail put NN to safemode
–Avoids unnecessary replication
–Bring back the datanodes or rack
• Track dead datanodes
–Bring back datanodes when the number grows
• Ensure cluster storage utilization is < 85%
–When the cluster is nearly full things slow down
• Monitor for corrupt blocks
–Delete tmp files with replication factor = 1 and missing blocks
• Have a portfolio of cluster validation tests/jobs
–Run them on restart, upgrade & config changes
Page 16
© Hortonworks Inc. 2013
Tools To Manage Clusters
• Use Balancer periodically
–Distributes data and hence processing
–Important to run after expanding the cluster
–Use appropriate balancer bandwidth – does not need restart
– dfsadmin –setBalancerBandwidth <bandwidth>
• Decommissioning
–Before removing/replacing DNs from the cluster
• Distcp for copying data to another cluster
–Backup, Disaster recovery
–More enhancements to come in the near future
• Tooling can be done around JMX/JMX http
–See the list - http://<nn>/jmx?get=Hadoop:service=NameNode
–All information equivalent to NN WebUI
Page 17
© Hortonworks Inc. 2013
Further Simplify Management
• HDFS uses JBODs with replication, not RAID
–Monitors nodes, disks, block checksums
–Automatic Recovery - parallel – very fast
– Recovers entire 12TB node in 10s of minutes in a 100 node cluster
Compare with the cost & urgency of repairing a RAID 5!
• Spare cluster capacity further simplifies management
–Nodes/clusters continue to run on failures, with lower capacity
– Nodes and disks can be fixed when convenient (unlike RAID)
– Configure how many disk failures => node failure
–1 operator can manage 3-4K nodes
Page 18
© Hortonworks Inc. 2013
Design For Multi-tenancy
• Share compute capacity with Capacity Scheduler
– Queue(s) and sub-queues with a guaranteed capacity per tenant
–Almost like dedicated hardware
–Better than private cluster –access to unused capacity
–Resource limits for tasks
– Memory limits are monitored
– C-groups just got into Yarn
– Resource isolation without VM overhead!
• Share HDFS Storage
–Set quotas per-user and per-project data directories
–Federation - Isolate categories of uses to separate namespaces
– Production vs. experimental, HBase etc.
Page 19
© Hortonworks Inc. 2013
Train Users
• Train users on best practices on writing apps
• Reduce storage use
–Delete unnecessary data periodically
–Move cold data into Hadoop archive
• Encourage using replication >= 3 for important data
–Hot data also needs higher replication
• Setup a small test cluster
–Users test their code before moving to production
–Avoid debugging in production cluster
• Setup user mailing list for information exchange
• Encourage creating jiras in Apache
–Helps community identify issues, fix bugs, stabilize quickly
Page 20
© Hortonworks Inc. 2013
Thank You – Q&A
Summary
1. Choose suitable server hardware and cluster sizes
2. Configuration is key
3. Checkpointing
4. Don’t edit metadata files
5. Guard against accidental deletions
6. Monitor usage and failures
7. Use available tools for managing the cluster
8. Simplify management with spare capacity
9. Design for multi-tenancy
10. Train your users on best practices
Page 21

Contenu connexe

En vedette

Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
HBase Replication
HBase ReplicationHBase Replication
HBase Replicationctrezzo
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Apache Ambari Meetup - AMS & Grafana
Apache Ambari Meetup - AMS & GrafanaApache Ambari Meetup - AMS & Grafana
Apache Ambari Meetup - AMS & GrafanaPrajwal Rao
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIOhhyin
 
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)Swiss Big Data User Group
 
Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityCloudera, Inc.
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!DataWorks Summit
 
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Athemaster Co., Ltd.
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 

En vedette (20)

Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
HBase Replication
HBase ReplicationHBase Replication
HBase Replication
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Apache Ambari Meetup - AMS & Grafana
Apache Ambari Meetup - AMS & GrafanaApache Ambari Meetup - AMS & Grafana
Apache Ambari Meetup - AMS & Grafana
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIO
 
Prdc2012
Prdc2012Prdc2012
Prdc2012
 
Soldagem 2009 2-emi
Soldagem 2009 2-emiSoldagem 2009 2-emi
Soldagem 2009 2-emi
 
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)
 
Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High Availability
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Dernier (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Top 10 Things to Get The Most Out of Your Hadoop Cluster

  • 1. © Hortonworks Inc. 2013 Top 10 things to get the most out of your Hadoop Cluster Suresh Srinivas | @suresh_m_s Sanjay Radia | @srr Page 1
  • 2. © Hortonworks Inc. 2013 About Me • Architect & Founder at Hortonworks • Long time Apache Hadoop committer and PMC member • Designed and developed many key Hadoop features • Experience from supporting many clusters –Including some of the world’s largest Hadoop clusters Page 2
  • 3. © Hortonworks Inc. 2013 Agenda Best Practices, Tips and Tricks for • Building cluster • Configuration • Monitoring • Reliability • Multi-tenancy Page 3
  • 4. © Hortonworks Inc. 2013 Hardware and Cluster Sizing • Considerations –Larger clusters heal faster on nodes or disk failure –Machines with huge storage take longer to recover –More racks give more failure domains • Recommendations – Get good-quality commodity hardware – Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores – More memory is better – real time is memory hungry! – Before considering fatter machines (1U 6 disks vs. 2U 12 disks) – Get to 30-40 machines or 3-4 racks –Use pilot cluster to learn about load patterns – Balanced hardware for I/O, compute or memory bound –Rule of thumb – network to compute cost of 20% –More details - http://tinyurl.com/hwx-hadoop-hw Page 4
  • 5. © Hortonworks Inc. 2013 Configuration is Key • Avoid JVM issues –Use 64 bit JVM for all daemons – Compressed OOPS enabled by default (6 u23 and later) –Java heap size – Set same max and starting heapsize, Xmx == Xms – Avoid java defaults – configure NewSize and MaxNewSize – Use 1/8 to 1/6 of max size for JVMs larger than 4G –Use low-latency GC collector – -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N> – High <N> on Namenode and JobTracker –Important JVM configs to help debugging – -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails – -XX:ErrorFile=<file> – -XX:+HeapDumpOnOutOfMemoryError Page 5
  • 6. © Hortonworks Inc. 2013 Configuration is Key… • Multiple redundant dirs for namenode metadata –One of dfs.name.dir should be on NFS –NFS softmount - tcp,soft,intr,timeo=20,retrans=5 • Configure open fd ulimit –Default 1024 is too low –16K for datanodes, 64K for Master nodes • Setup cluster nodes with time synchronization • Use version control for configuration! Page 6
  • 7. © Hortonworks Inc. 2013 Configuration is Key… • Use disk fail in place for datanodes –Disk failure is no longer datanode failure –Especially important for large density nodes • Set dfs.namenode.name.dir.restore to true –Restores NN storage directory during checkpointing • Take periodic backups of namenode metadata –Make copies of the entire storage directory • Master node OS device should be highly available –RAID-1 (mirrored pair) • Set aside a lot of disk space for NN logs –It is verbose – set aside multiple GBs –Many installs configure this too small – NN logs roll with in minutes – hard to debug issues Page 7
  • 8. © Hortonworks Inc. 2013 Checkpointing • Secondary Namenode - confusing name Page 8
  • 9. © Hortonworks Inc. 2013 Checkpointing… • Setup a single secondary namenode –Periodically merges file system image with journal –Two secondary namenodes not supported – Many instances of accidental two secondary namenodes – Known to cause metadata corruption! • In HA setup standby replaces secondary • Ensure periodic checkpoints are happening –Checkpoint time can be queried in scripts – Shown in NN webUI as well –Real incident – A cluster was run for more than a year with no checkpoint! – Namenode stopped when it ran out of disk space – NN was running for more than an year – no restart!!! – Restoring the cluster was not fun! Page 9
  • 10. © Hortonworks Inc. 2013 Don’t edit the metadata files! • Editing can corrupt the cluster state –Might result in loss of data • Real incident –NN misconfigured to point to another NN’s metadata –DNs can’t register due to namespace ID mismatch – System detected the problem correctly – Safety net ignored by the admin! –Admin edits the namenode VERSION file to match ids What Happens Next? Page 10
  • 11. © Hortonworks Inc. 2013 Guard Against Accidental Deletion • rm –r deletes the data at the speed of Hadoop! –ctrl-c of the command does not stop deletion! –Undeleting files on datanodes is hard & time consuming – Immediately shutdown NN, unmount disks on datanodes – Recover deleted files – Start namenode without the delete operation in edits • Enable Trash • Real Incident –Customer is running a distro of Hadoop with trash not enabled –Deletes a large dir (100 TB) and shuts down NN immediately –Support person asks NN to be restarted to see if trash is enabled! What happens next? • Now HDFS has Snapshots! Page 11
  • 12. © Hortonworks Inc. 2013 Monitor Usage • Cluster storage, nodes, files, blocks grows – Update NN heap, handler count, number of DN xceivers – Tweak other related config periodically • Monitor the hardware usage for your work load – Disk I/O, network I/O, CPU and memory usage – Use this information when expanding cluster capacity • Monitor the usage with HADOOP metrics – JVM metrics – GC times, Memory used, Thread Status – RPC metrics – especially latency to track slowdowns – HDFS metrics – Used storage, # of files and blocks, total load on the cluster – File System operations – MapReduce Metrics – Slot utilization and Job status • Tweak configurations during upgrades/maintenance Page 12
  • 13. © Hortonworks Inc. 2013 Monitoring Simplified With Ambari Cluster Metrics Summary Page 13
  • 14. © Hortonworks Inc. 2013 Monitoring Simplified With Ambari HDFS Metrics Summary Page 14
  • 15. © Hortonworks Inc. 2013 Monitoring Simplified With Ambari MapReduce Metrics Summary Page 15
  • 16. © Hortonworks Inc. 2013 Monitor Failures • If a large % of datanodes fail put NN to safemode –Avoids unnecessary replication –Bring back the datanodes or rack • Track dead datanodes –Bring back datanodes when the number grows • Ensure cluster storage utilization is < 85% –When the cluster is nearly full things slow down • Monitor for corrupt blocks –Delete tmp files with replication factor = 1 and missing blocks • Have a portfolio of cluster validation tests/jobs –Run them on restart, upgrade & config changes Page 16
  • 17. © Hortonworks Inc. 2013 Tools To Manage Clusters • Use Balancer periodically –Distributes data and hence processing –Important to run after expanding the cluster –Use appropriate balancer bandwidth – does not need restart – dfsadmin –setBalancerBandwidth <bandwidth> • Decommissioning –Before removing/replacing DNs from the cluster • Distcp for copying data to another cluster –Backup, Disaster recovery –More enhancements to come in the near future • Tooling can be done around JMX/JMX http –See the list - http://<nn>/jmx?get=Hadoop:service=NameNode –All information equivalent to NN WebUI Page 17
  • 18. © Hortonworks Inc. 2013 Further Simplify Management • HDFS uses JBODs with replication, not RAID –Monitors nodes, disks, block checksums –Automatic Recovery - parallel – very fast – Recovers entire 12TB node in 10s of minutes in a 100 node cluster Compare with the cost & urgency of repairing a RAID 5! • Spare cluster capacity further simplifies management –Nodes/clusters continue to run on failures, with lower capacity – Nodes and disks can be fixed when convenient (unlike RAID) – Configure how many disk failures => node failure –1 operator can manage 3-4K nodes Page 18
  • 19. © Hortonworks Inc. 2013 Design For Multi-tenancy • Share compute capacity with Capacity Scheduler – Queue(s) and sub-queues with a guaranteed capacity per tenant –Almost like dedicated hardware –Better than private cluster –access to unused capacity –Resource limits for tasks – Memory limits are monitored – C-groups just got into Yarn – Resource isolation without VM overhead! • Share HDFS Storage –Set quotas per-user and per-project data directories –Federation - Isolate categories of uses to separate namespaces – Production vs. experimental, HBase etc. Page 19
  • 20. © Hortonworks Inc. 2013 Train Users • Train users on best practices on writing apps • Reduce storage use –Delete unnecessary data periodically –Move cold data into Hadoop archive • Encourage using replication >= 3 for important data –Hot data also needs higher replication • Setup a small test cluster –Users test their code before moving to production –Avoid debugging in production cluster • Setup user mailing list for information exchange • Encourage creating jiras in Apache –Helps community identify issues, fix bugs, stabilize quickly Page 20
  • 21. © Hortonworks Inc. 2013 Thank You – Q&A Summary 1. Choose suitable server hardware and cluster sizes 2. Configuration is key 3. Checkpointing 4. Don’t edit metadata files 5. Guard against accidental deletions 6. Monitor usage and failures 7. Use available tools for managing the cluster 8. Simplify management with spare capacity 9. Design for multi-tenancy 10. Train your users on best practices Page 21