SlideShare une entreprise Scribd logo
1  sur  91
@esigler
Eric Sigler, Head of DevOps, PagerDuty
Incident Response & Coordination
@esigler
Everyone can improve their
Incident Response process
@esigler
Take the time to clearly define
your process today
@esigler
Why should organizations invest
time improving it?
@esigler Puppet / DORA “State of DevOps 2016 Report”
@esigler
What is Incident Response?
@esigler
Prepare Execute Improve
Incident Response
“Outer Loop”
@esigler
Prepare Execute Improve
@esigler
Prepare:
Monitoring
Alerting
Process
Practice
@esigler
Prepare:
Monitoring
Alerting
Process
Practice
@esigler
“If Engineering at Etsy has a religion,
it’s the Church of Graphs. If it moves,
we track it.”
@esigler
@esigler
Don’t forget the
business metrics
@esigler
Prepare:
Monitoring
Alerting
Process
Practice
@esigler
Setting up an alarm
in AWS CloudWatch
@esigler
@esigler
@esigler
@esigler
@esigler
@esigler
@esigler
“I don’t want to get
woken up at 3AM.”
@esigler
Scope down your alerts!
@esigler
Make it Immediate
@esigler
A problem in Production at 3AM?
I’m there with bells on.
A problem in Staging at 3AM?
Maybe less so.
@esigler
Make it Human
@esigler
Humans are terrible shell script
interpreters, especially at 3AM.
@esigler
Make it Actionable
The “everything’s OK”
alarm.
@esigler
@esigler
Alerts should be:
Immediately
Human
Actionable
@esigler
… so relate them to the business.
“We now have more time to focus on other proje
Connie-Lynne Villani
Senior Manager
@esigler
Configure AWS CloudWatch to
integrate with PagerDuty
@esigler
Grouping Alerts
@esigler
@esigler
@esigler
@esigler
@esigler
AWS Incidents in PagerDuty
@esigler
AWS Incidents in PagerDuty
@esigler
Prepare:
Monitoring
Alerting
Process
Practice
@esigler
“Like getting lawn care advice from the
superintendent of Augusta National”
response.pagerduty.com
@esigler
Know Your Role(s)…
@esigler
Incident Commander
@esigler
Consider a volunteer
IC schedule
@esigler
Deputy Incident Commander
@esigler
Scribe
@esigler
Subject Matter Experts
@esigler
What criteria should you use to
launch an Incident Response?
@esigler
Post incident criteria widely.
Don’t litigate during a call.
@esigler
Prepare:
Monitoring
Alerting
Process
Practice
@esigler
Practice your Incident Response
plan beforehand
@esigler
Consider injecting failure, and
testing “all of the above”.
@esigler
Don’t forget to include the rest of
the business in your Incident
Coordination
@esigler
Prepare Execute Improve
@esigler
Assess & Triage Resolve or Remediate Learn & Review
Incident Response
“Inner Loop”
@esigler
Assess & Triage Resolve or Remediate Learn & Review
@esigler
Elect a leader (IC) at the
beginning of the call
@esigler
How to give a status update
@esigler
Assess & Triage Resolve or Remediate Learn & Review
@esigler
Delegating tasks on a call
@esigler
Don’t forget to check back in.
@esigler
Have a clear mechanism for
making decisions.
@esigler
“IC, I think we should do X”
“The proposed action is X,
is there any strong objection?”
@esigler
Dealing with
communications “challenges”
@esigler
Humor is best in context.
@esigler
DT5: Roger that
GND: Delta Tug 5, you can go right on bravo
DT5: Right on bravo, taxi.
(…): Testing, testing. 1-2-3-4.
GND: Well, you can count to 4. It’s a step in the right direction. Find
another frequency to test on now.
(…): Sorry
@esigler
Assess & Triage Resolve or Remediate Learn & Review
@esigler
Capture everything, and call out
what’s important now vs. later.
@esigler
Decreasing the scope of a call
@esigler
Capture everything for the
postmortem / learning review.
@esigler
Prepare Execute Improve
@esigler
“You can’t fire your way to
reliability.” Ensure your
postmortems are blameless
@esigler
Beware of:
Counterfactual Reasoning
Normative Language
Mechanistic Reasoning
@esigler
Maintain every postmortem
in a collection / archive.
@esigler
Review your Incident
Response process
@esigler
“We’ve handed out responsibility for handling
alerts to the teams that know the most about
the service. They’re the people who can
generally fix things fastest.”
Sam Eaton
Vice President of Engineering
@esigler
FD: “OK, why don’t, you gotta pass the data for the crew checklist anyway onboard, d
MC: “Right”
FD: “Don’tcha got a page update? Well why don't we read it up to them and that'll se
MC: “Alright.”
FD: “Both that mattered as well as what page you want it in the checklist?”
MC: “OK.”
@esigler
TELMU: "Flight, TELMU.”
FD: "Go TELMU.”
TELMU: "We show the LEM overhead hatch is closed, and the heater current looks n
FD: "OK."
GUIDE: "Flight, Guidance."
FD: "Go Guidance"
GUIDE: "We've had a hardware restart, I don't know what it was."
@esigler
FD: "GNC, you wanna look at it? See if you've seen a problem"
Lovell: "Houston, we've had a problem ..."
FD: "Rog, we're copying it CAPCOM, we see a hardware restart"
Lovell: "... Main B Bus undervolt"
FD: "You see an AC bus undervolt there guidance, er, ah, EECOM?"
EECOM: "Negative flight"
FD: "I believe the crew reported it."
???: "We got a main B undervolt"
@esigler
EECOM: "OK flight we've got some instrumentation issues ... let me add em up”
FD: "Rog"
CAPCOM: "OK stand by 13 we're looking at it"
EECOM: "We may have had an instrumentation problem flight"
FD: "Rog"
INCO: "Flight, INCO”
FD: "Go INCO”
INCO: "We switched to wide beam about the time he had that problem"
@esigler
Haise: "...the voltage is looking good. And we had a pretty large bang associated with
FD: "OK"
CAPCOM: "Roger, Fred."
FD: "INCO, you said you went to wide beam with that?"
INCO: "Yes"
FD: "Let's see if we can correlate those times get the time when you went to wide-beam
INCO: "OK"
@esigler
@esigler
Challenge: Audit your alarms this
week. Are they all immediately
human actionable?
@esigler Puppet / DORA “State of DevOps 2016 Report”
@esigler
You'll sleep better at night
See You At AWS Summit San
Francisco!
April 18-19
aws.amazon.com/summits/san-francisco/
@esigler
LEARN MORE
response.pagerduty.com
@esigler
Eric Sigler, Head of DevOps, PagerDuty
Incident Response & Coordination

Contenu connexe

En vedette

Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...
Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...
Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...Amazon Web Services
 
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech TalksHands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech TalksAmazon Web Services
 
Developing Applications with the IoT Button - March 2017 AWS Online Tech Talks
Developing Applications with the IoT Button - March 2017 AWS Online Tech TalksDeveloping Applications with the IoT Button - March 2017 AWS Online Tech Talks
Developing Applications with the IoT Button - March 2017 AWS Online Tech TalksAmazon Web Services
 
Deep Dive on Amazon S3 - March 2017 AWS Online Tech Talks
Deep Dive on Amazon S3 - March 2017 AWS Online Tech TalksDeep Dive on Amazon S3 - March 2017 AWS Online Tech Talks
Deep Dive on Amazon S3 - March 2017 AWS Online Tech TalksAmazon Web Services
 
Introduction to DevOps and the AWS Code Services
Introduction to DevOps and the AWS Code ServicesIntroduction to DevOps and the AWS Code Services
Introduction to DevOps and the AWS Code ServicesAmazon Web Services
 
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at ScaleAmazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at ScaleAmazon Web Services
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Amazon Web Services
 
Automate Software Deployments on EC2 with AWS CodeDeploy
Automate Software Deployments on EC2 with AWS CodeDeployAutomate Software Deployments on EC2 with AWS CodeDeploy
Automate Software Deployments on EC2 with AWS CodeDeployAmazon Web Services
 
Infrastructure Continuous Delivery Using AWS CloudFormation
Infrastructure Continuous Delivery Using AWS CloudFormationInfrastructure Continuous Delivery Using AWS CloudFormation
Infrastructure Continuous Delivery Using AWS CloudFormationAmazon Web Services
 
Large-Scale AWS Migrations with CSC
Large-Scale AWS Migrations with CSCLarge-Scale AWS Migrations with CSC
Large-Scale AWS Migrations with CSCAmazon Web Services
 
Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...
Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...
Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...Amazon Web Services
 
Application Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless WorldApplication Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless WorldAmazon Web Services
 
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...Amazon Web Services
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksAmazon Web Services
 
Mastering Access Control Policies
Mastering Access Control PoliciesMastering Access Control Policies
Mastering Access Control PoliciesAmazon Web Services
 
An Overview of Designing Microservices Based Applications on AWS - March 2017...
An Overview of Designing Microservices Based Applications on AWS - March 2017...An Overview of Designing Microservices Based Applications on AWS - March 2017...
An Overview of Designing Microservices Based Applications on AWS - March 2017...Amazon Web Services
 

En vedette (20)

Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...
Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...
Active Archiving with Amazon S3 and Tiering to Amazon Glacier - March 2017 AW...
 
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech TalksHands-on Labs: Getting Started with AWS  - March 2017 AWS Online Tech Talks
Hands-on Labs: Getting Started with AWS - March 2017 AWS Online Tech Talks
 
Developing Applications with the IoT Button - March 2017 AWS Online Tech Talks
Developing Applications with the IoT Button - March 2017 AWS Online Tech TalksDeveloping Applications with the IoT Button - March 2017 AWS Online Tech Talks
Developing Applications with the IoT Button - March 2017 AWS Online Tech Talks
 
CloudFormation Best Practices
CloudFormation Best PracticesCloudFormation Best Practices
CloudFormation Best Practices
 
AWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomateAWS OpsWorks for Chef Automate
AWS OpsWorks for Chef Automate
 
Deep Dive on Amazon S3 - March 2017 AWS Online Tech Talks
Deep Dive on Amazon S3 - March 2017 AWS Online Tech TalksDeep Dive on Amazon S3 - March 2017 AWS Online Tech Talks
Deep Dive on Amazon S3 - March 2017 AWS Online Tech Talks
 
Introduction to DevOps and the AWS Code Services
Introduction to DevOps and the AWS Code ServicesIntroduction to DevOps and the AWS Code Services
Introduction to DevOps and the AWS Code Services
 
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at ScaleAmazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
Automate Software Deployments on EC2 with AWS CodeDeploy
Automate Software Deployments on EC2 with AWS CodeDeployAutomate Software Deployments on EC2 with AWS CodeDeploy
Automate Software Deployments on EC2 with AWS CodeDeploy
 
Infrastructure Continuous Delivery Using AWS CloudFormation
Infrastructure Continuous Delivery Using AWS CloudFormationInfrastructure Continuous Delivery Using AWS CloudFormation
Infrastructure Continuous Delivery Using AWS CloudFormation
 
Large-Scale AWS Migrations with CSC
Large-Scale AWS Migrations with CSCLarge-Scale AWS Migrations with CSC
Large-Scale AWS Migrations with CSC
 
Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...
Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...
Getting the Most Out of the New Amazon EC2 Reserved Instances Enhancements - ...
 
Application Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless WorldApplication Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless World
 
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
 
IAM Best Practices
IAM Best PracticesIAM Best Practices
IAM Best Practices
 
IAM Introduction
IAM IntroductionIAM Introduction
IAM Introduction
 
Mastering Access Control Policies
Mastering Access Control PoliciesMastering Access Control Policies
Mastering Access Control Policies
 
An Overview of Designing Microservices Based Applications on AWS - March 2017...
An Overview of Designing Microservices Based Applications on AWS - March 2017...An Overview of Designing Microservices Based Applications on AWS - March 2017...
An Overview of Designing Microservices Based Applications on AWS - March 2017...
 

Similaire à Improve Your Incident Response Process

Deploying 30 times a day, and making sure everything stays 200 OK by Eric Sigler
Deploying 30 times a day, and making sure everything stays 200 OK by Eric SiglerDeploying 30 times a day, and making sure everything stays 200 OK by Eric Sigler
Deploying 30 times a day, and making sure everything stays 200 OK by Eric SiglerDevOpsDays Baltimore
 
You shouldneverdo
You shouldneverdoYou shouldneverdo
You shouldneverdodaniil3
 
Talking about craftsmanship with ensaimadas and katas
Talking about craftsmanship with ensaimadas and katasTalking about craftsmanship with ensaimadas and katas
Talking about craftsmanship with ensaimadas and katasRachel M. Carmena
 
Back to basics simple, elegant, beautiful code
Back to basics   simple, elegant, beautiful codeBack to basics   simple, elegant, beautiful code
Back to basics simple, elegant, beautiful codeAndrew Harcourt
 
Talking about craftsmanship with "ensaimadas" and katas (May, 2018)
Talking about craftsmanship with "ensaimadas" and katas (May, 2018)Talking about craftsmanship with "ensaimadas" and katas (May, 2018)
Talking about craftsmanship with "ensaimadas" and katas (May, 2018)Rachel M. Carmena
 
What Are We Still Doing Wrong
What Are We Still Doing WrongWhat Are We Still Doing Wrong
What Are We Still Doing Wrongafa reg
 
Wade not in unknown waters. Part three.
Wade not in unknown waters. Part three.Wade not in unknown waters. Part three.
Wade not in unknown waters. Part three.PVS-Studio
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst PracticesBurt Beckwith
 
Insight Design, Part 1
Insight Design, Part 1Insight Design, Part 1
Insight Design, Part 1Robert Saenz
 
I don’t understand it fido
I don’t understand it fidoI don’t understand it fido
I don’t understand it fidoErle Howard
 
A year with event sourcing and CQRS
A year with event sourcing and CQRSA year with event sourcing and CQRS
A year with event sourcing and CQRSSteve Pember
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and batteryVitali Pekelis
 
College App Essay Format - Sample College Admissio
College App Essay Format - Sample College AdmissioCollege App Essay Format - Sample College Admissio
College App Essay Format - Sample College AdmissioMonica Franklin
 
Release Engineering and Rugged DevOps: An Intersection?
Release Engineering and Rugged DevOps: An Intersection?Release Engineering and Rugged DevOps: An Intersection?
Release Engineering and Rugged DevOps: An Intersection?SeniorStoryteller
 
Algorithms - Future Decoded 2016
Algorithms - Future Decoded 2016Algorithms - Future Decoded 2016
Algorithms - Future Decoded 2016Frank Krueger
 

Similaire à Improve Your Incident Response Process (20)

Deploying 30 times a day, and making sure everything stays 200 OK by Eric Sigler
Deploying 30 times a day, and making sure everything stays 200 OK by Eric SiglerDeploying 30 times a day, and making sure everything stays 200 OK by Eric Sigler
Deploying 30 times a day, and making sure everything stays 200 OK by Eric Sigler
 
You shouldneverdo
You shouldneverdoYou shouldneverdo
You shouldneverdo
 
Talking about craftsmanship with ensaimadas and katas
Talking about craftsmanship with ensaimadas and katasTalking about craftsmanship with ensaimadas and katas
Talking about craftsmanship with ensaimadas and katas
 
Back to basics simple, elegant, beautiful code
Back to basics   simple, elegant, beautiful codeBack to basics   simple, elegant, beautiful code
Back to basics simple, elegant, beautiful code
 
Talking about craftsmanship with "ensaimadas" and katas (May, 2018)
Talking about craftsmanship with "ensaimadas" and katas (May, 2018)Talking about craftsmanship with "ensaimadas" and katas (May, 2018)
Talking about craftsmanship with "ensaimadas" and katas (May, 2018)
 
A is for Angular
A is for AngularA is for Angular
A is for Angular
 
Clean code and code smells
Clean code and code smellsClean code and code smells
Clean code and code smells
 
What Are We Still Doing Wrong
What Are We Still Doing WrongWhat Are We Still Doing Wrong
What Are We Still Doing Wrong
 
Wade not in unknown waters. Part three.
Wade not in unknown waters. Part three.Wade not in unknown waters. Part three.
Wade not in unknown waters. Part three.
 
Tdd is not about testing
Tdd is not about testingTdd is not about testing
Tdd is not about testing
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
Grails Worst Practices
Grails Worst PracticesGrails Worst Practices
Grails Worst Practices
 
How to Roll Rocks Downhill FASTER
How to Roll Rocks Downhill FASTERHow to Roll Rocks Downhill FASTER
How to Roll Rocks Downhill FASTER
 
Insight Design, Part 1
Insight Design, Part 1Insight Design, Part 1
Insight Design, Part 1
 
I don’t understand it fido
I don’t understand it fidoI don’t understand it fido
I don’t understand it fido
 
A year with event sourcing and CQRS
A year with event sourcing and CQRSA year with event sourcing and CQRS
A year with event sourcing and CQRS
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
 
College App Essay Format - Sample College Admissio
College App Essay Format - Sample College AdmissioCollege App Essay Format - Sample College Admissio
College App Essay Format - Sample College Admissio
 
Release Engineering and Rugged DevOps: An Intersection?
Release Engineering and Rugged DevOps: An Intersection?Release Engineering and Rugged DevOps: An Intersection?
Release Engineering and Rugged DevOps: An Intersection?
 
Algorithms - Future Decoded 2016
Algorithms - Future Decoded 2016Algorithms - Future Decoded 2016
Algorithms - Future Decoded 2016
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...university
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...Henrik Hanke
 

Dernier (19)

Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
 

Improve Your Incident Response Process

Notes de l'éditeur

  1. Howdy! I’m Eric Sigler, Head of DevOps at PagerDuty, and I’m here to talk to y’all today about Incident Response & Coordination.
  2. First, let’s start with a couple of audience questions. (Who currently uses PagerDuty?) Who here has been on an “outage call” of some kind? Who’s been on an outage call for more than 4 hours? Who’s been on an outage call with over 20 people? Who’s been on a call when the CEO or another high ranking person “swooped” in? These are all examples of … less than ideal incident response behaviors. As an industry, we inherited a lot of baggage and cruft from earlier eras. But I believe…
  3. … Everyone can improve their Incident Response process. We have modern ways of building stacks, like EC2 & ECS, and modern ways of structuring our applications (Microservices, Lambdas, etc). Organizations that want to be the best need to have modern ways of responding to problems as well.
  4. And I’m going to call out that every organization should do so, and do so now. An ounce of prevention today is worth, etc, etc. You’ll have an easier, faster time responding to issues, and a better time preventing small issues from cascading into large ones.
  5. So, as with everything else, there’s a cost here, right? Why should organizations invest in it? What’s the ROI on having a great Incident Coordination process?
  6. Well, for one thing, as you adopt a lot of these practices, you’ll find that you end up spending less time on unplanned work. Getting back 20% of your engineering time to spend on more features usually sounds pretty good.
  7. War Rooms, Outage Calls, Incident Response, Incident Command, Incident Coordination, there are lots of different ways of outlining the same set of topics. So, let’s put one together.
  8. Here’s one way to think about Incident Response that we’ve seen from a large number of our customers across various industries - as a series of loops. There’s an “outer loop”, if you will, that goes like this - prepare (through setting up monitoring, alerting, and processes), then execution (once you’ve actually come upon an incident), then improvement (learning from your incident, and making improvements to the next time) - this all closes out in a constant feedback cycle where one incident, better or worse, influences the next.
  9. So then, the first part of our outer loop - prepare. This is where I’ll point out that the more effort you spend here, the less impact and easier time you’ll have on the other two items.
  10. OK, so within prepare, there’s a ton of stuff. (For those of you keeping score, we’ve got lists and loops - I promise to stop before hash tables.) Here’s a few different things we’ll go over.
  11. First up, monitoring. It’s actually pretty easy - Etsy has it right …
  12. Log everything. _Everything_. The point here is to make this so cheap to collect, that it costs more NOT to. Common libraries, centralized / shared services, and convenience are key. Business, application, infrastructure, client, deployment pipeline, and so on.
  13. Order velocity, customer contact rate, these aren’t necessarily the first things you think about hooking up to your monitoring tool of choice, but they’re actually probably the most valuable, as we’ll get to in a bit.
  14. Now that you’ve got all this data, you need to alert when it’s not right!
  15. OK, so you’ve probably already seen this a few times today, but I wanted to walk through setting up an alarm in CloudWatch super quick for anyone who hasn’t…
  16. Go into the AWS Control Panel, select CloudWatch…
  17. Avoid flappy alarms!
  18. But don’t do that! Most of this should be in the form of code. CloudFormation, for example, lets you configure CloudWatch alarms programmatically. Preferably, code that’s under the control of your service-owning-teams. Remember - with responsibility comes authority, and they have to be allowed to make their own choices.
  19. And that brings us to the #1 most common complaint we’ve heard for all time - people don’t want to get alerted! I’ve got a quick solution…
  20. … don’t have as many!
  21. First, it should be an alarm that requires immediate attention. By that, I mean it shouldn’t be something that “can wait until tomorrow”. Alert appropriately. Another way of looking at it…
  22. … is like this. (YMMV on follow-the-sun issues.) There are tools out there that have urgencies and severities, make sure to leverage those where possible. Not all Immediate things are created equal.
  23. Second, you want to have something that requires a human. If “the first step in the run book” is to run a script to purge /tmp, then just do that automatically, because …
  24. … because Humans are terrible compilers. Don’t turn them into one, especially if you don’t have to. Lambda functions, autoscaling systems, heck even just cron triggers and touch files are all better than operational toil. The next thing to remember on an alert is …
  25. If the alert is basically “something happened”, well, that’s not super useful. It tends to very quickly condition people to ignore it, and that’s the worst thing of all to do. The classic version of this is the antipattern…
  26. … the “everything’s OK alarm”. In the Simpsons, Homer comes up with an alarm that goes off every 3 seconds, to indicate everything is OK. I know that sounds rediculous, but how many times have you asked yourself, “I haven’t gotten an alert in a while, is the monitoring system still working?” - which is a great indicator of this pattern. The last question you should ask when reviewing alerts is …
  27. Pulling it all together, here’s what should go through your head every time you create something that will try to get your attention. And it turns out that probably has a curious side effect…
  28. And when you go down this process of distilling your alerts, you’ll find that more often than not, they’ll tend to be related to your business needs. Huh. Guess collecting that business data was important after all.
  29. Here’s a screenshot of our product for a bit of context. Originally, this area below wasn’t this way - we accepted events from customers, but didn’t do a lot with the data we accepted. As we grew the product, we wanted to provide more value for our customers, and tease out more details. But there was a challenge …
  30. Next up, now that we have all this awesome data, and are only hearing about it when there’s an issue. We need some process to actually deal with all of it! First, let’s figure out the “who”.
  31. And here might be a good place to pause and take a 10 minute break. A lot of what I’m going to say is captured here at our open source response documentation - so if you want spoilers, feel free to read ahead.
  32. You need to define who’s going to do what. There’s a thing called the “National Incident Management System”, that’s used by firefighters and other first responders. If you think about it, this is a group of people who you pick up the phone, dial a number, and within 10 minutes receive several million dollars of hardware and a dozen well trained engineers. There’s probably something to learn here. (These terms are slightly military focused, because that’s where NIMS comes from.)
  33. And it starts with the “IC”. Not individual contributor, but Incident Commander. This is someone who _isn’t_ doing the work, but is responsible for the meta-process. The mechanical turning of the crank. They rally the team together, dispatch tasks, check in on status, update stakeholders (potentially), make choices if needed, escalate priorities, and finally disband once the incident is over.
  34. Airbnb, PagerDuty, lots of organizations use a volunteer schedule for their On-Call. It sounds absolutely crazy, I know, but think about it for a second. An IC is going to have to have gravitas, and be able to kick maybe even the CEO off the call. That’s not something you want to force people to do.
  35. First rule - never have a SPOF.
  36. Can also act as a bridge between voice and chat. Everyone else is too busy, it’s this person’s job to capture the details between the cracks.
  37. Aha! That’s you! OK, at least most of you. The point of ALL of this process is to give SME’s the room to solve the problem. At the end of the day, hopefully the majority of your incidents are novel - so you need SMEs focused on understanding, mitigating, and resolving them, instead of dealing with the “meta process” around it.
  38. Different alerts will require different levels of response, like we talked about before. A lot of organizations use some sort of graduated scale, usually based on Customer Impact. Some organizations won’t decrease their severity level, others will. But the key takeaway in defining your criteria is …
  39. … Whatever your criteria is, post it widely. Litigating “is this really worth it?” during an incident call is a huge waste of time and resources. Agree to it, and stick with it.
  40. All right, so we’ve got all that, I’m in New York, so I have to ask, how do you get to Carnegie Hall?
  41. Game Days, Failure Fridays, Tabletop’ing, all are valid ways of exercising. The point is to work out the kinks in your process when everything is OK, and it’s not 3AM, and revenue isn’t on the line. This give responders “muscle memory”, which becomes incredibly helpful so that you aren’t figuring things out on the fly.
  42. Netflix does a great job of this. So they’re known for Chaos Monkey, a tool that goes through and does controlled, plausible, real failure injection into your infrastructure. And it turns out, this is a great thing for not only checking your automation, but for testing your monitoring, alerting, and response processes all at the same time.
  43. In both planning and execution (which we’ll get to later) … long gone are the days of the lone sysadmin saving the day. In fact - if I’m honest - kill your hero culture. “Firefighters get all the glory, but building inspectors save all the lives.” Be a building inspector.
  44. “What’s with the preparing you’re always preparing!” OK, now that we have our monitoring, alerting, and process in order, and we’ve practiced it, time to wait for an incident! Don’t worry, you probably won’t have to wait too long. This in turn results in what we consider an “inner loop” of Incident Response, and it looks very similar…
  45. Assess & Triage. Maybe it’s automatically handled, maybe it’s suppressed, maybe it’s escalated to a human.
  46. OK! So, our alarms have fired! And maybe a human has looked at the alarm, and decided this was a high enough urgency that the kick off the Incident process (or it kicks off automatically). What’s first?
  47. Zookeeper example. This is critical, or the call will be chaos.
  48. State facts, separate from observations & hypothesis, separate from proposed actions. ICs can poll for status, or SMEs can push status.
  49. The IC should act as a coordinator. SME X proposes an idea, and IC gets consensus, then distributes the task.
  50. “SME X, do Y, I’ll check back in with you in Z minutes.”
  51. OK, so now you have an idea of what you need to do. But maybe two of the SMEs aren’t agreeing. Instead of “soooo, what should we do?”, here’s a different way of asking the same thing.
  52. “Do you want to be an IC?”
  53. Once you’ve identified the key SMEs, take steps to decrease the impact and cost of the response. Side tangent: “Scope of control”. Think in teams of 7-10. Anything bigger? Break out into sub-groups.
  54. At the end of your incident coordination lifecycle, there’s the aspect of capturing and learning that feeds into the next iteration.
  55. So then, the first part of our outer loop - prepare. This is where I’ll point out that the more effort you spend here, the less impact and easier time you’ll have on the other two items.
  56. It’s true. See also recent issue with GitLab - “we let that person down”.
  57. Here’s a fresh off the presses slide - if you check my twitter feed you’ll find a great youtube video on cognitive biases to be careful of in your postmortems. “ would likely have” (counter), “insufficient” (normative), assume the system is in perfect working order until the incident (mechanistic)
  58. This becomes a part of your organization’s deep history. There’s significant value to be mined from archives like this - and that’s part of why even less significant issues should go in here.
  59. Don’t forget to do some meta-analysis as well about your IC process. What went well? What should be changed next time? Are you getting better or worse? More folks or fewer folks?
  60. Here’s an example postmortem template … let’s go through a few pieces of it. Apologies for the eye chart.
  61. Yelp went down this path - and it’s enabled them to scale, grow along their own devops journey, and handle things far faster than before. I’m going to wrap up by giving you a challenge…
  62. Go, find those alarms this week. It’s only Thursday. Remember …
  63. … you can get back your time, and at the very least …
  64. … you’ll get a better night’s sleep. Thank you everyone!
  65. Questions?
  66. And I’ll leave you with that URL to our open sourced response documentation. Thank you very much!