Datadog + Rundeck at DASH 2020


Find it. Fix it. Fast.
Datadog has alerted you to a problem.
The clock is ticking. How do you take action?

  2. 2. Service Owner Monitoring Engineer Is there any way to streamline these repeatable tasks? Chasing all these alerts is time consuming I can see where the problems are, I just don’t have a way to fix them How am I going to hit my KPI of reducing alert counts and MTTR? Why am I getting woken up at all hours from my monitoring team? How am I going to hit my KPI of service availability and reliability? How could I provide that team the access they need to troubleshoot before they call? I have these scripts, what if they could just run them for me?
  3. 3. How do we make it easier for the first line of defense to take action? How much time are your subject matter experts spending on tasks that can be automated? How fast can we gather additional troubleshooting information or attempt a fix? Monitoring solutions today know a lot about the health of your infrastructure, but lack the ability to do something about it.
  4. 4. Confidential 1. Decipher the wiki (what does it mean? how old?) 2. Ad-hoc tool/script usage (where? syntax?) 3. ESCALATE! 3 options: Without RBA With RBA Runbook Automation
  5. 5. Can I see and example of Automating a fix using Rundeck? Our application has two NGNIX servers. 1 If these servers go down, the first troubleshooting step is always “Restart the Service”. 2 Using Datadog to track the service status, we can automate this procedure by firing a webhook to Rundeck. 3 Of course!
  6. 6. Demo Time!
  7. 7. Safely provide task execution to teams that don’t directly manage a service or infrastructure. Reduce burden on Subject Matter Experts and allow them to focus on critical issues. Automate the first line of defense tasks. If you have any “try this first every time” actions then it’s likely something that can be automated. RUNDECK STREAMLINES REPEATABLE AUTOMATION TO TURN MONITORING INTO RESOLVING So why Datadog + Rundeck Automation?