Cyrus Dasadia introduces CitoEngine, an open source alert management and automation tool that helps teams manage the growing number of alerts generated by monitoring systems. CitoEngine allows users to easily define events, create actions for those events, and integrate with third party applications. It creates unique incidents for each alert, provides detailed dashboards to view current and acknowledged incidents, and generates reports. CitoEngine is designed to help network operations centers, DevOps, and operations teams manage alerts more efficiently.
2. Who is Cyrus?
● Sysadmin / Part time programmer for over 14+ yrs.
● Monty Python fan.
● Sr. Tech Lead at InMobi.
Twitter: @ExtremeUnix
Email: cyrus@extremeunix.com
3. Why the long face ?
● You installed the best monitoring application.
● You have awesome monitoring scripts.
● You purchased a monitoring service.
● You have the best NOC/Incident management team.
but..
4. Even the best teams
succumb to it.
cthuluhu image: http://ordinary-gentlemen.com/blog/2013/10/10/god-digs-ambiguity
5. So, what leads to this problem?
● As servers and teams grow, there are even more alerts to
manage.
● Alerts are not constantly tuned for changing thresholds.
● Monitoring tools generate false positive events.
● Teams don't follow discipline to ack alerts during
releases/outages.
6. What can help manage this chaos?
all logos are trademarks of their respective companies
7. What can really help ?
A tool that:
● Lets me easily define events.
● Lets me create actions on such events.
● Easily integrates with 3rd party applications.
Most of you are sysadmins, developers or devops. You’ve been through this process a lot, especially guys from NOC (I feel your pain)
WHen it rains, it pours. Your one service could create so much noise that you end up missing other alerts.
As you grow, monitoring needs fine tuning, you have to keep changing thresholds and as always you have that guy who forgot to disable notification for during a maitenance or outage
AWS CloudWatch:
Can alert on almost any AWS service.
Can be used to trigger SNS.
Limited to mostly AWS infrastructure
Not Free
Sensu:
Has ability to add conditional routers.
Mostly a monitoring framework.
Needs sensu client.
M/Monit:
Scope limited to process, files or directories.
riemann.io:
Good stream management service.
Bit high learning curve.
and the list goes on.....
Emphasis on simplicity of use.
Integrates with any monitoring system, define events and take actions againts them by invoking plugins. Its as simple as that.
Overview:
CitoEngine accepts events via a simple REST API, sends message to RabbitMQ or SQS, consumes it and takes actions upon them.
Actions on events are done by invoking plugins/scripts via a plugin server.
Emphasis on simplicity of use and architecture.
Its an external server which can be run in isolation.
Its an external server which can be run in isolation.