SlideShare une entreprise Scribd logo
1  sur  53
Nagios XI Large ImplementationTips and Tricks by Nathan Brodericknroderick@nuskin.comnagios@aivector.com Nu Skin International
How to Start
Planning Determine what systems are important Determine the need for legacy monitors Get a company organization chart Break the monitoring into manageable components Make a written migration plan
Meat and Potato Monitoring 3 AM Rule Monitor critical items first The best monitors check core processes Keep it simple HTTP Checks, Ping Checks Only set up what is necessary Don’t over-monitor
Programs We Migrated
Programs BAC SiteScope 6 SiteScope 8 Foglight – monitors Java processes and WebLogic ProcessError ( Nu Skin Custom Email Solution ) Notify Attention Custom scripts on Linux systems
SiteScope 6, 8 SiteScope is one of HP tools for monitoring We gathered inventory on the alerts Migrated alerts one at a time Nagios has plugins for most alerts After migrating alerts, we uncovered existing problems
Migrating BAC(HP Business Availability Center) BAC was used to monitor complex business processes BAC alerts were ignored The cost per monitor was about $1300 BAC monitors were migrated one at a time to Selenium
Migrating NNM (Network Node Manager) NNM is an HP tool for monitoring network devices We built a Selenium script to migrate NNM Heavy load on system because of SNMP traps NNM spammed the Help Desk Monitors “Cried Wolf” frequently Difficult to manage
Replacing Foglight with JMX JMX enables monitoring of JMX attributes in Java systems The “check_jmx” plugin monitors Java systems We built a custom JMX module that collects data from WebLogic
Custom Scripts We used “send_nsca” with the scripts Scripts usually ran from Cron Where possible, we replaced the scripts with Nagiosplugins We worked with Administrators
Custom Email Solution Most the alerts received were too difficult to understand We migrated some alerts and stopped having others sent Most errors were ignored by Help Desk
Old Alerts ERRORS have occurred during copy and processing of the sovoldiff file to odyssey. The process did NOT complete successfully: For additional information, see /home/volgen/vandg/sovoldiff.odyssey.diag on icarus. Also, check sovoldiff.log: .sovoldiff:odyssey beginning remote copy of data [04/25/11 13:29 MDT] .sovoldiff:odyssey completed remote copy of data [04/25/11 13:29 MDT] .sovoldiff:odyssey beginning remote execution [04/25/11 13:29 MDT] .sovoldiff:odyssey beginning sovolhdrdiffsqlldr [04/25/11 13:29 MDT] ....Commit point reached - logical record count 200 .sovoldiff:odyssey completed sqlldr [04/25/11 13:29 MDT] .sovoldiff:odyssey WARNING: run of sqlldr on odyssey has generated 33 entries i .n sovolhdrdiff.bad [04/25/11 13:29 MDT] ...d|01-G5235390|||N||||08112010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|CAW9181762|||N||||19042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK3045390|||N||||20032010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK3053267|||N||||31012010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8046272|||N||||23042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8110877|||N||||01042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8122941|||N||||02032010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8126233|||N||||04042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8130967|||N||||08042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8131201|||N||||08042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8131462|||N||||04042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8139745|||N||||19102010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8140172|||N||||09042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8140300|||N||||09042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8142160|||N||||09042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8143219|||N||||08022010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8143457|||N||||23042010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8144185|||N||||14052010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8146497|||N||||05042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8146771|||N||||05042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| .sovoldiff:odyssey Bad data has been located, notification being emailed to bvc .handr@nuskin.com,jlpoulse@nuskin.com,swgiles@nuskin.net,bjbeck@nuskin.com [04/ .25/11 13:29 MDT] .sovoldiff:odyssey ERROR: 67 data errors, above threshold of 53 - more than 1 o .ut of every 1000 rows. See on odyssey in sovolhdrdiff.log ora1.out [04/25/11 1 .3:29 MDT] .sovoldiff:odyssey completed remote execution [04/25/11 13:29 MDT] .sovoldiff:odyssey ERROR: Did NOT successfully complete operations in minion [0 .4/25/11 13:29 MDT] .sovoldiff:odyssey ERROR: find: cannot stat */* .warnings .5/11 13:29 MDT]
[object Object],[object Object]
Choose the Nagios Monitoring Method Active Check from Nagios Server (normal)  Active Check performed by remote client NRPE, NSClient Passive Check – Listen to 3rd party monitors NSCA
Integrating with OEM ( Oracle Enterprise Manager ) OEM is used to monitor Oracle databases Monitoring is done by OEM and Grid Control Alerts are passed to Nagios using “send_nsca” Nagios alerts are cleared by the Help Desk Passive alerts or “buckets” in Nagios
Integrating SAP “NRPE” collects process data from SAP Status monitoring using “send_nsca” Monitored at a system level Nagios has plugins to monitor SAP
Tools We Use with Nagios
Selenium Testing tool for running software tests Freeware tool to enable Business Process Monitoring We use a headless Linux installation of Selenium Selenium was adapted to monitor web sites Easy to use
Recording in Selenium The following demo will show how you can record user’s interactions with websites in Selenium
Selenium Demo
Playback in Selenium The following demo will show you how to play back the Selenium recordings
Using the Selenium Script The following demo will show you how to use the Selenium recording
Working Demo of Nagios and Selenium The following is a practical demo of Selenium
Mechanize Open source Perl module for web pages Mechanize is for monitoring sites with a simple login and not many pages Mechanize requires technical ability Mechanize had better documentation than WebInject Fast and lightweight Better than Selenium for simple sites
NRPE ( Nagios Remote Program Execution ) Enables monitoring of important processes like diskspace, CPU, memory, processes, etc. Reduces load on the servers Increases monitor reliability Remote monitoring in other countries Checks are active – meaning they are initiated from Nagios Available through the Dag repository with “yum –y install nagios-nrpe”
NRPE Demo The following demo will show how to monitor disk space on a server with Nagios and NRPE
NRPE Demo NRPE Demo
SEND_NSCA Allows remote scripts and systems to contact  Nagios directly Send_nsca is easy to integrate  Very little overhead Reduces the load on the Nagios server Great replacement for SNMP Requires some technical knowledge
SEND_NSCA Demo The following demo will show you how to monitor disk space on a server using “send_nsca”
SEND_NSCA Demo Send_nsca demo
NSClient++ Client for monitoring Windows systems Can monitor disk, CPU, active directory, etc Reduces load on Nagios server Increases monitoring accuracy Easy integration with Nagios
NSClient++ The following demo shows how to monitor Windows using NSClient++
Nsclient ++ Demo
NRPE-NT New Client that acts like NRPE on Linux Helps monitor processes locally on systems Checks are initiated from Nagios Install NSClient++ on server Add the NRPE .dll file to enable port 5666
Challenges and Solutions
SNMP Challenges “Simple” Network Management Protocol not so simple Cryptic Messages Hard to understand alerts when problems happen Chatty – Likes to send messages every second New releases can change OIDs Outdated
SNMP Solutions Use a separate system to catch SNMP traps Process the messages and forward them to Nagios using “send_nsca” Try not to use SNMP; work with vendors to provide alternative solutions
Backup and Restore Nagios We needed a way to backup the Nagios application and system Backups are highly recommended! Nagios created a backup/restore script for everybody We created an automated script to perform a backup and a restore every week
Load Challenges At about 500 monitors we realized we needed more hardware At first we upgraded the number of CPU cores  At 1,000 we experienced high IO wait times We decreased the check times from every two minutes to every three minutes
Server Load Reduction Techniques Change the “temp_path” and “temp_file” locations in the nagios.cfg file Point the “status_file” parameter to the RAM disk Change the “object_cache_file” to the RAM disk Use EMC disk for /var/lib and /usr/local Ensure host and service latency times are low Use NRPE and “send_nsca” Run interpreted scripts on a separate system whenever possible
Items of Interest
Nagios Graphs Meaningful names come from the scripts or programs Renaming graphs can cause the graph to break We built a solution to clean up broken graphs Good graphs have the metric and a clear data source We would like the capability to control the data source
Example of a Good Graph
Example of a Bad Graph
Dev, Test, and Production Dev, Test, Prod environment is ideal One system is not enough New alerts should be tested before deployment New alerts can cause load on the system or unexpected problems New releases of Nagios should be tested and deployed after thorough testing
Customizing the Wizards We needed custom wizards Nagios team created wizards for us We adapted some wizards to our needs Requires PHP knowledge
Creating a Map We needed a map for the alerts We custom created our own map Our map queries the Nagios database directly for information
Custom Globe
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips and Tricks

Contenu connexe

Tendances

Zabbix visión general del sistema - 04.12.2013
Zabbix   visión general del sistema - 04.12.2013Zabbix   visión general del sistema - 04.12.2013
Zabbix visión general del sistema - 04.12.2013
Emmanuel Arias
 

Tendances (20)

NagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
NagiosXI - Astiostech NagiosXI Event with NTT MSC CyberjayaNagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
NagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios MonitoringNagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
 
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
 
What is Nagios XI and how is it different from Nagios Core
What is Nagios XI and how is it different from Nagios CoreWhat is Nagios XI and how is it different from Nagios Core
What is Nagios XI and how is it different from Nagios Core
 
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
 
Nagios Conference 2014 - Nick Winn - Using Nagios XI to Empower Your Develope...
Nagios Conference 2014 - Nick Winn - Using Nagios XI to Empower Your Develope...Nagios Conference 2014 - Nick Winn - Using Nagios XI to Empower Your Develope...
Nagios Conference 2014 - Nick Winn - Using Nagios XI to Empower Your Develope...
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...
Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...
Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...
 
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and NagiosNagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
 
Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Ava...
Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Ava...Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Ava...
Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Ava...
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Zabbix visión general del sistema - 04.12.2013
Zabbix   visión general del sistema - 04.12.2013Zabbix   visión general del sistema - 04.12.2013
Zabbix visión general del sistema - 04.12.2013
 
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...Nagios Conference 2011 - Daniel Wittenberg -  Scaling Nagios At A Giant Insur...
Nagios Conference 2011 - Daniel Wittenberg - Scaling Nagios At A Giant Insur...
 
Zabbix introduction ( RadixCloud Radix Technologies SA)
Zabbix introduction ( RadixCloud Radix Technologies SA)Zabbix introduction ( RadixCloud Radix Technologies SA)
Zabbix introduction ( RadixCloud Radix Technologies SA)
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
Dev Talk: Event Manipulation and Testing
Dev Talk: Event Manipulation and Testing Dev Talk: Event Manipulation and Testing
Dev Talk: Event Manipulation and Testing
 
Securing the Helix Platform at Citrix
Securing the Helix Platform at CitrixSecuring the Helix Platform at Citrix
Securing the Helix Platform at Citrix
 
Moving Windows Applications to the Cloud
Moving Windows Applications to the CloudMoving Windows Applications to the Cloud
Moving Windows Applications to the Cloud
 
OSMC 2014: Interesting use cases of Zabbix improvements in latest versions | ...
OSMC 2014: Interesting use cases of Zabbix improvements in latest versions | ...OSMC 2014: Interesting use cases of Zabbix improvements in latest versions | ...
OSMC 2014: Interesting use cases of Zabbix improvements in latest versions | ...
 

Similaire à Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips and Tricks

Upgrade to 2008 Best of PASS
Upgrade to 2008 Best of PASSUpgrade to 2008 Best of PASS
Upgrade to 2008 Best of PASS
sqlserver.co.il
 

Similaire à Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips and Tricks (20)

Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu SkinNagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
 
Monitoring Server Temperature with Opsview
Monitoring Server Temperature with OpsviewMonitoring Server Temperature with Opsview
Monitoring Server Temperature with Opsview
 
Lesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfLesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdf
 
Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)Linux Server Deep Dives (DrupalCon Amsterdam)
Linux Server Deep Dives (DrupalCon Amsterdam)
 
EnCase Enterprise Basic File Collection
EnCase Enterprise Basic File Collection EnCase Enterprise Basic File Collection
EnCase Enterprise Basic File Collection
 
Upgrade to 2008 Best of PASS
Upgrade to 2008 Best of PASSUpgrade to 2008 Best of PASS
Upgrade to 2008 Best of PASS
 
Nagios En
Nagios EnNagios En
Nagios En
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
NRPE - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core 4 and others.
NRPE - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core 4 and others.NRPE - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core 4 and others.
NRPE - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core 4 and others.
 
Cl221
Cl221Cl221
Cl221
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
 
Infrastructure as Code to Maintain your Sanity
Infrastructure as Code to Maintain your SanityInfrastructure as Code to Maintain your Sanity
Infrastructure as Code to Maintain your Sanity
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShift
 
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
 
AEO Training - 2023.pdf
AEO Training - 2023.pdfAEO Training - 2023.pdf
AEO Training - 2023.pdf
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Monitoring at/with SUSE 2015
Monitoring at/with SUSE 2015Monitoring at/with SUSE 2015
Monitoring at/with SUSE 2015
 
Datafoucs 2014 on line digital forensic investigations damir delija 2
Datafoucs 2014 on line digital forensic investigations damir delija 2Datafoucs 2014 on line digital forensic investigations damir delija 2
Datafoucs 2014 on line digital forensic investigations damir delija 2
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at Yelp
 

Plus de Nagios

Plus de Nagios (20)

Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
 
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
 
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
 

Dernier

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips and Tricks

  • 1. Nagios XI Large ImplementationTips and Tricks by Nathan Brodericknroderick@nuskin.comnagios@aivector.com Nu Skin International
  • 3. Planning Determine what systems are important Determine the need for legacy monitors Get a company organization chart Break the monitoring into manageable components Make a written migration plan
  • 4. Meat and Potato Monitoring 3 AM Rule Monitor critical items first The best monitors check core processes Keep it simple HTTP Checks, Ping Checks Only set up what is necessary Don’t over-monitor
  • 6. Programs BAC SiteScope 6 SiteScope 8 Foglight – monitors Java processes and WebLogic ProcessError ( Nu Skin Custom Email Solution ) Notify Attention Custom scripts on Linux systems
  • 7. SiteScope 6, 8 SiteScope is one of HP tools for monitoring We gathered inventory on the alerts Migrated alerts one at a time Nagios has plugins for most alerts After migrating alerts, we uncovered existing problems
  • 8. Migrating BAC(HP Business Availability Center) BAC was used to monitor complex business processes BAC alerts were ignored The cost per monitor was about $1300 BAC monitors were migrated one at a time to Selenium
  • 9. Migrating NNM (Network Node Manager) NNM is an HP tool for monitoring network devices We built a Selenium script to migrate NNM Heavy load on system because of SNMP traps NNM spammed the Help Desk Monitors “Cried Wolf” frequently Difficult to manage
  • 10. Replacing Foglight with JMX JMX enables monitoring of JMX attributes in Java systems The “check_jmx” plugin monitors Java systems We built a custom JMX module that collects data from WebLogic
  • 11. Custom Scripts We used “send_nsca” with the scripts Scripts usually ran from Cron Where possible, we replaced the scripts with Nagiosplugins We worked with Administrators
  • 12. Custom Email Solution Most the alerts received were too difficult to understand We migrated some alerts and stopped having others sent Most errors were ignored by Help Desk
  • 13. Old Alerts ERRORS have occurred during copy and processing of the sovoldiff file to odyssey. The process did NOT complete successfully: For additional information, see /home/volgen/vandg/sovoldiff.odyssey.diag on icarus. Also, check sovoldiff.log: .sovoldiff:odyssey beginning remote copy of data [04/25/11 13:29 MDT] .sovoldiff:odyssey completed remote copy of data [04/25/11 13:29 MDT] .sovoldiff:odyssey beginning remote execution [04/25/11 13:29 MDT] .sovoldiff:odyssey beginning sovolhdrdiffsqlldr [04/25/11 13:29 MDT] ....Commit point reached - logical record count 200 .sovoldiff:odyssey completed sqlldr [04/25/11 13:29 MDT] .sovoldiff:odyssey WARNING: run of sqlldr on odyssey has generated 33 entries i .n sovolhdrdiff.bad [04/25/11 13:29 MDT] ...d|01-G5235390|||N||||08112010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|CAW9181762|||N||||19042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK3045390|||N||||20032010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK3053267|||N||||31012010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8046272|||N||||23042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8110877|||N||||01042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8122941|||N||||02032010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8126233|||N||||04042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8130967|||N||||08042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8131201|||N||||08042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8131462|||N||||04042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8139745|||N||||19102010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8140172|||N||||09042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8140300|||N||||09042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8142160|||N||||09042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8143219|||N||||08022010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8143457|||N||||23042010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8144185|||N||||14052010|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8146497|||N||||05042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| ...d|HK8146771|||N||||05042011|OPEN|0.00|0.00|0.00|0.00||0||N|N||||||||ADP|AF| .sovoldiff:odyssey Bad data has been located, notification being emailed to bvc .handr@nuskin.com,jlpoulse@nuskin.com,swgiles@nuskin.net,bjbeck@nuskin.com [04/ .25/11 13:29 MDT] .sovoldiff:odyssey ERROR: 67 data errors, above threshold of 53 - more than 1 o .ut of every 1000 rows. See on odyssey in sovolhdrdiff.log ora1.out [04/25/11 1 .3:29 MDT] .sovoldiff:odyssey completed remote execution [04/25/11 13:29 MDT] .sovoldiff:odyssey ERROR: Did NOT successfully complete operations in minion [0 .4/25/11 13:29 MDT] .sovoldiff:odyssey ERROR: find: cannot stat */* .warnings .5/11 13:29 MDT]
  • 14.
  • 15. Choose the Nagios Monitoring Method Active Check from Nagios Server (normal) Active Check performed by remote client NRPE, NSClient Passive Check – Listen to 3rd party monitors NSCA
  • 16. Integrating with OEM ( Oracle Enterprise Manager ) OEM is used to monitor Oracle databases Monitoring is done by OEM and Grid Control Alerts are passed to Nagios using “send_nsca” Nagios alerts are cleared by the Help Desk Passive alerts or “buckets” in Nagios
  • 17. Integrating SAP “NRPE” collects process data from SAP Status monitoring using “send_nsca” Monitored at a system level Nagios has plugins to monitor SAP
  • 18. Tools We Use with Nagios
  • 19. Selenium Testing tool for running software tests Freeware tool to enable Business Process Monitoring We use a headless Linux installation of Selenium Selenium was adapted to monitor web sites Easy to use
  • 20. Recording in Selenium The following demo will show how you can record user’s interactions with websites in Selenium
  • 22. Playback in Selenium The following demo will show you how to play back the Selenium recordings
  • 23.
  • 24. Using the Selenium Script The following demo will show you how to use the Selenium recording
  • 25.
  • 26. Working Demo of Nagios and Selenium The following is a practical demo of Selenium
  • 27.
  • 28. Mechanize Open source Perl module for web pages Mechanize is for monitoring sites with a simple login and not many pages Mechanize requires technical ability Mechanize had better documentation than WebInject Fast and lightweight Better than Selenium for simple sites
  • 29. NRPE ( Nagios Remote Program Execution ) Enables monitoring of important processes like diskspace, CPU, memory, processes, etc. Reduces load on the servers Increases monitor reliability Remote monitoring in other countries Checks are active – meaning they are initiated from Nagios Available through the Dag repository with “yum –y install nagios-nrpe”
  • 30. NRPE Demo The following demo will show how to monitor disk space on a server with Nagios and NRPE
  • 32. SEND_NSCA Allows remote scripts and systems to contact Nagios directly Send_nsca is easy to integrate Very little overhead Reduces the load on the Nagios server Great replacement for SNMP Requires some technical knowledge
  • 33. SEND_NSCA Demo The following demo will show you how to monitor disk space on a server using “send_nsca”
  • 35. NSClient++ Client for monitoring Windows systems Can monitor disk, CPU, active directory, etc Reduces load on Nagios server Increases monitoring accuracy Easy integration with Nagios
  • 36. NSClient++ The following demo shows how to monitor Windows using NSClient++
  • 38. NRPE-NT New Client that acts like NRPE on Linux Helps monitor processes locally on systems Checks are initiated from Nagios Install NSClient++ on server Add the NRPE .dll file to enable port 5666
  • 40. SNMP Challenges “Simple” Network Management Protocol not so simple Cryptic Messages Hard to understand alerts when problems happen Chatty – Likes to send messages every second New releases can change OIDs Outdated
  • 41. SNMP Solutions Use a separate system to catch SNMP traps Process the messages and forward them to Nagios using “send_nsca” Try not to use SNMP; work with vendors to provide alternative solutions
  • 42. Backup and Restore Nagios We needed a way to backup the Nagios application and system Backups are highly recommended! Nagios created a backup/restore script for everybody We created an automated script to perform a backup and a restore every week
  • 43. Load Challenges At about 500 monitors we realized we needed more hardware At first we upgraded the number of CPU cores At 1,000 we experienced high IO wait times We decreased the check times from every two minutes to every three minutes
  • 44. Server Load Reduction Techniques Change the “temp_path” and “temp_file” locations in the nagios.cfg file Point the “status_file” parameter to the RAM disk Change the “object_cache_file” to the RAM disk Use EMC disk for /var/lib and /usr/local Ensure host and service latency times are low Use NRPE and “send_nsca” Run interpreted scripts on a separate system whenever possible
  • 46. Nagios Graphs Meaningful names come from the scripts or programs Renaming graphs can cause the graph to break We built a solution to clean up broken graphs Good graphs have the metric and a clear data source We would like the capability to control the data source
  • 47. Example of a Good Graph
  • 48. Example of a Bad Graph
  • 49. Dev, Test, and Production Dev, Test, Prod environment is ideal One system is not enough New alerts should be tested before deployment New alerts can cause load on the system or unexpected problems New releases of Nagios should be tested and deployed after thorough testing
  • 50. Customizing the Wizards We needed custom wizards Nagios team created wizards for us We adapted some wizards to our needs Requires PHP knowledge
  • 51. Creating a Map We needed a map for the alerts We custom created our own map Our map queries the Nagios database directly for information

Notes de l'éditeur

  1. Weblogic, we had to expose part of the JVM – we expose remote JMX port from the JVM. Central server where the job daemon is running. Runs on port 5697 We maintain a connection to the Weblogic servers. Collects
  2. Are you recording websites or user interactions with websites? I presume that the recording is of the user’s interactions with a website for later, or repeated, playback of those same interactions.
  3. Whoa! Now Send_NSCA has an underline between send and nsca! What should it be, really???The SEND is all capitals in the title but not in the bulleted text in this or previous slides. See the title of the following slide for yet another variation.
  4. I recommend you drop the metric in the third bullet: At 1,000 monitors, we experienced high IO wait times {The added comma to designate one thousand and the capitalization of IO is intentional.}
  5. Way cool globe!