SlideShare une entreprise Scribd logo
1  sur  27
Making Disaster Routine
Anticipating and Practicing Failures Using Active Monitoring and Chaos
Engineering
Peter Varhol and Gerie Owen
About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com
Gerie Owen
3
• QA Evangelist, test manager
• Subject matter expert on testing for
TechTarget’s SearchSoftwareQuality.com
• International and domestic conference
presenter
• Marathon runner & running coach
gowen@qualitestgroup.com
Agenda
• DevOps and disaster
• Preparing for disaster
• Principles of chaos
• Monitoring for disaster
• Getting back on your feet
• Conclusions
What is DevOps?
• Containerized development, rapid iteration with real-time
performance insights, intelligent feedback, diagnostic services, an
integrated DevOps pipeline, and deployment to the cloud
• Boshe moi!
• In layman’s terms:
• We automatically integrate and build every time there is a valid check-in
• We run automated tests at all stages, including production
• We send the app to production when it has been integrated and tested
• Automation makes it all work like a Swiss watch
What is a Disaster?
• A serious disruption, occurring over a relatively short time, loss and
impacts, which exceeds the ability of the affected community or
society to cope using its own resources.
• Disruption
• Short timeframe
• Exceeds the ability to cope
What is a Disaster
• Consistency becomes uncertain
• Automated workflow breaks down
• Build fails; smoke tests are blocked
• Server farm goes offline
• Application won’t start again
• Showstopper bug in production
• Anything that disrupts consistency
Preparing for Disaster
• We don’t react well when things go wrong
• Disbelief
• Uncertainty
• Panic
• How can we prepare for the unknown?
We Can Learn from Aircrews
• US Airways Flight 1549
• Sullenberger and Skiles had never met before that day
• Yet worked from established procedures
• Practiced for hundreds of hours
• Immediately turned to checklists
• 90 seconds after the bird strike, they were in the Hudson
• You have to practice this
We Can Learn From Aircrews
• Indecision and panic are killers
• Checklists drive decision-making by focusing on essentials
• Courses of action are defined fast
• Practice makes disasters just another day in the office
• Clear and structured communications is essential
We Can Also Practice Disaster
• Chaos engineering
• Failure scenarios
• Application health monitoring
Chaos Engineering
• Distributed systems at scale
• Experiments to uncover systemic weaknesses
• Defining normal behavior
• Set your null and alternative hypothesis
• Introduce variables that reflect real world events
• servers crash
• hard drives malfunction
• network connections lost
• Try to disprove the null hypothesis
Chaos Engineering
• Practice in production
• Vary real world events
• Yes, there could be customer impact
• It is incumbent upon the chaos engineer to minimize customer impact
• But that is the point of the experiment
Chaos Monkey
• Now called Simian Army
• Developed by Netflix
• Causes breakdowns in their production environment
• Now consists of a variety of tools
• It’s all about resiliency
• Can our application survive?
Practice Failure Scenarios
• Each team member contributes one or more scenarios
• The more unlikely, the better
• Write up the scenarios
• Only the team leader sees them beforehand
• They can be real failures experienced or thought exercises
Practice Failure Scenarios
• Describe the scenarios to the team
• “Load is remaining constant but performance is gradually
deteriorating. We’re starting to get 404 and related errors. The server
farm seems to be operating correctly; it’s an application issue. Pings
are slowing down, but not drastically.”
• How do we diagnose and address?
We Don’t Need Another Hero
• Heroes use superhuman efforts to fix a disaster
• In doing so, they break with team conventions
• Teams function better together
• If a team has a hero:
• the team may not try as hard in the future
• the hero is not replicable
• the hero can’t solve every problem
Monitor Application Health in Production
• Ping just doesn’t cut it any more
• Availability and performance data
• Synthetic testing
• Health over time
• Track trends of performance, page painting, database calls
• Whatever might give you health trends
Directions for Monitoring
• Watermarks for action
• E.g., 25 percent of pages take longer than 2 seconds to load
• AI for prediction
• Based on similar results in the past, the application is likely to fail in six hours
• Analytics for trends
• A combination of six measures indicates unhealthy trends
The Power of Checklists
• Checklists are part of our daily lives
• They
• relieve the cognitive load of remembering to do’s
• organize complicated decision-making
• reduce risk in complicated activities by ensuring that critical tasks are not
overlooked.
Types of Checklists
Using Checklists in DevOps
• Checklists can be used to:
• Replace Test Cases
• Supplement Test Cases
• Verify Entry and Exit Criteria
• Sanity Testing
• Ambiguity Reviews
• Dev Estimates
Types of Checklists
• Project Set Up
• Application Specific Regression
• Process type specific
• Website Graphics
• Browser Dependencies
• Usability checks
What Does Thinking Of Failure Accomplish?
• Failure doesn’t come as a surprise
• It does so all too often
• We have procedures to deal with failure
• We have practice dealing with failure
• Failure is just another day at the office
A Final Lesson
• You are not alone
Conclusions
• Things will go wrong
• Don’t yell or panic
• Practice non-conforming situations regularly
• Make up unlikely scenarios; chances are they will happen
• Structured practices and communications may make work boring, but
they help when things start going wrong
• Ease into chaos engineering for resiliency
• Use your experiences to create checklists
Making disaster routine

Contenu connexe

Tendances

Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Codemotion
 
Continuous Integration Is for Everyone—Especially DevOps
Continuous Integration Is for Everyone—Especially DevOpsContinuous Integration Is for Everyone—Especially DevOps
Continuous Integration Is for Everyone—Especially DevOpsTechWell
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous DeploymentBrian Henerey
 
Robert and Anne Sabourin: Gauging Software Health
Robert and Anne Sabourin: Gauging Software HealthRobert and Anne Sabourin: Gauging Software Health
Robert and Anne Sabourin: Gauging Software HealthAnna Royzman
 
DevOps Picc12 Management Talk
DevOps Picc12 Management TalkDevOps Picc12 Management Talk
DevOps Picc12 Management TalkMichael Rembetsy
 
Quality at Speed - Penny Wyatt
Quality at Speed - Penny WyattQuality at Speed - Penny Wyatt
Quality at Speed - Penny WyattAtlassian
 
Testing in a Continuous World
Testing in a Continuous WorldTesting in a Continuous World
Testing in a Continuous WorldLisi Hocke
 
The Business Case for DevOps - Justifying the Journey
The Business Case for DevOps - Justifying the JourneyThe Business Case for DevOps - Justifying the Journey
The Business Case for DevOps - Justifying the JourneyXebiaLabs
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioDevOpsDays Tel Aviv
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
SDET approach for Agile Testing
SDET approach for Agile TestingSDET approach for Agile Testing
SDET approach for Agile TestingGopikrishna Kannan
 
Nf final chef-lisa-metrics-2015-ss
Nf final chef-lisa-metrics-2015-ssNf final chef-lisa-metrics-2015-ss
Nf final chef-lisa-metrics-2015-ssNicole Forsgren
 
Anatomy of Three Incidents -- Commonalities and Lessons
Anatomy of Three Incidents -- Commonalities and LessonsAnatomy of Three Incidents -- Commonalities and Lessons
Anatomy of Three Incidents -- Commonalities and LessonsRandy Shoup
 
TestDriven Development, Why How and Smells
TestDriven Development, Why How and SmellsTestDriven Development, Why How and Smells
TestDriven Development, Why How and SmellsProwareness
 
Microservices Summit - The Human Side of Services
Microservices Summit - The Human Side of ServicesMicroservices Summit - The Human Side of Services
Microservices Summit - The Human Side of ServicesYelp Engineering
 
Moving QA from Reactive to Proactive with qTest
Moving QA from Reactive to Proactive  with qTestMoving QA from Reactive to Proactive  with qTest
Moving QA from Reactive to Proactive with qTestQASymphony
 
Soft Skills You Need Are Not Always Taught in Class
Soft Skills You Need Are Not Always Taught in ClassSoft Skills You Need Are Not Always Taught in Class
Soft Skills You Need Are Not Always Taught in ClassTechWell
 
DevOPs Transformation Workshop
DevOPs Transformation WorkshopDevOPs Transformation Workshop
DevOPs Transformation WorkshopJules Pierre-Louis
 

Tendances (20)

Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
Continuous Integration Is for Everyone—Especially DevOps
Continuous Integration Is for Everyone—Especially DevOpsContinuous Integration Is for Everyone—Especially DevOps
Continuous Integration Is for Everyone—Especially DevOps
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
Robert and Anne Sabourin: Gauging Software Health
Robert and Anne Sabourin: Gauging Software HealthRobert and Anne Sabourin: Gauging Software Health
Robert and Anne Sabourin: Gauging Software Health
 
DevOps Picc12 Management Talk
DevOps Picc12 Management TalkDevOps Picc12 Management Talk
DevOps Picc12 Management Talk
 
Quality at Speed - Penny Wyatt
Quality at Speed - Penny WyattQuality at Speed - Penny Wyatt
Quality at Speed - Penny Wyatt
 
Testing in a Continuous World
Testing in a Continuous WorldTesting in a Continuous World
Testing in a Continuous World
 
The Business Case for DevOps - Justifying the Journey
The Business Case for DevOps - Justifying the JourneyThe Business Case for DevOps - Justifying the Journey
The Business Case for DevOps - Justifying the Journey
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
SDET approach for Agile Testing
SDET approach for Agile TestingSDET approach for Agile Testing
SDET approach for Agile Testing
 
Nf final chef-lisa-metrics-2015-ss
Nf final chef-lisa-metrics-2015-ssNf final chef-lisa-metrics-2015-ss
Nf final chef-lisa-metrics-2015-ss
 
NYC MeetUp 10.9
NYC MeetUp 10.9NYC MeetUp 10.9
NYC MeetUp 10.9
 
Anatomy of Three Incidents -- Commonalities and Lessons
Anatomy of Three Incidents -- Commonalities and LessonsAnatomy of Three Incidents -- Commonalities and Lessons
Anatomy of Three Incidents -- Commonalities and Lessons
 
TestDriven Development, Why How and Smells
TestDriven Development, Why How and SmellsTestDriven Development, Why How and Smells
TestDriven Development, Why How and Smells
 
Microservices Summit - The Human Side of Services
Microservices Summit - The Human Side of ServicesMicroservices Summit - The Human Side of Services
Microservices Summit - The Human Side of Services
 
Moving QA from Reactive to Proactive with qTest
Moving QA from Reactive to Proactive  with qTestMoving QA from Reactive to Proactive  with qTest
Moving QA from Reactive to Proactive with qTest
 
DevOps: Hype or Hope
DevOps: Hype or HopeDevOps: Hype or Hope
DevOps: Hype or Hope
 
Soft Skills You Need Are Not Always Taught in Class
Soft Skills You Need Are Not Always Taught in ClassSoft Skills You Need Are Not Always Taught in Class
Soft Skills You Need Are Not Always Taught in Class
 
DevOPs Transformation Workshop
DevOPs Transformation WorkshopDevOPs Transformation Workshop
DevOPs Transformation Workshop
 

Similaire à Making disaster routine

Tester Challenges in Agile ?
Tester Challenges in Agile ?Tester Challenges in Agile ?
Tester Challenges in Agile ?alind tiwari
 
Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​
Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​
Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​Eggplant
 
Agile Acceptance testing with Fitnesse
Agile Acceptance testing with FitnesseAgile Acceptance testing with Fitnesse
Agile Acceptance testing with FitnesseClareMcLennan
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesPeter Varhol
 
Adopting Agile
Adopting AgileAdopting Agile
Adopting AgileCoverity
 
You build it, you run it
You build it, you run itYou build it, you run it
You build it, you run itSkyscanner
 
How do we fix testing
How do we fix testingHow do we fix testing
How do we fix testingPeter Varhol
 
Java DevOps at Enterprise Scale
Java DevOps at Enterprise ScaleJava DevOps at Enterprise Scale
Java DevOps at Enterprise ScaleRyan McGuinness
 
Will The Test Leaders Stand Up?
Will The Test Leaders Stand Up?Will The Test Leaders Stand Up?
Will The Test Leaders Stand Up?Paul Gerrard
 
Mastering Complex Application Deployments
Mastering Complex Application DeploymentsMastering Complex Application Deployments
Mastering Complex Application DeploymentsIBM UrbanCode Products
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorialduleepa
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...QASymphony
 

Similaire à Making disaster routine (20)

Tester Challenges in Agile ?
Tester Challenges in Agile ?Tester Challenges in Agile ?
Tester Challenges in Agile ?
 
Agile process
Agile processAgile process
Agile process
 
Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​
Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​
Strategy vs. Tactical Testing: Actions for Today, Plans for Tomorrow​
 
Agile Acceptance testing with Fitnesse
Agile Acceptance testing with FitnesseAgile Acceptance testing with Fitnesse
Agile Acceptance testing with Fitnesse
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
Software Testing
Software TestingSoftware Testing
Software Testing
 
Adopting Agile
Adopting AgileAdopting Agile
Adopting Agile
 
You build it, you run it
You build it, you run itYou build it, you run it
You build it, you run it
 
Fundamental of testing
Fundamental of testingFundamental of testing
Fundamental of testing
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
How do we fix testing
How do we fix testingHow do we fix testing
How do we fix testing
 
Software testing - An Overview
Software testing - An OverviewSoftware testing - An Overview
Software testing - An Overview
 
Java DevOps at Enterprise Scale
Java DevOps at Enterprise ScaleJava DevOps at Enterprise Scale
Java DevOps at Enterprise Scale
 
Will The Test Leaders Stand Up?
Will The Test Leaders Stand Up?Will The Test Leaders Stand Up?
Will The Test Leaders Stand Up?
 
Mastering Complex Application Deployments
Mastering Complex Application DeploymentsMastering Complex Application Deployments
Mastering Complex Application Deployments
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
 
Invite the tester to the party
Invite the tester to the partyInvite the tester to the party
Invite the tester to the party
 
Working Effectively with PeopleSoft Support
Working Effectively with PeopleSoft SupportWorking Effectively with PeopleSoft Support
Working Effectively with PeopleSoft Support
 
Istqb foundation level day 1
Istqb foundation level   day 1Istqb foundation level   day 1
Istqb foundation level day 1
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...
 

Plus de Peter Varhol

Not fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational valuesNot fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational valuesPeter Varhol
 
DevOps and the Impostor Syndrome
DevOps and the Impostor SyndromeDevOps and the Impostor Syndrome
DevOps and the Impostor SyndromePeter Varhol
 
Not fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational valuesNot fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational valuesPeter Varhol
 
162 the technologist of the future
162   the technologist of the future162   the technologist of the future
162 the technologist of the futurePeter Varhol
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 
Digital transformation through devops dod indianapolis
Digital transformation through devops dod indianapolisDigital transformation through devops dod indianapolis
Digital transformation through devops dod indianapolisPeter Varhol
 
Testing for cognitive bias in ai systems
Testing for cognitive bias in ai systemsTesting for cognitive bias in ai systems
Testing for cognitive bias in ai systemsPeter Varhol
 
What Aircrews Can Teach Testing Teams
What Aircrews Can Teach Testing TeamsWhat Aircrews Can Teach Testing Teams
What Aircrews Can Teach Testing TeamsPeter Varhol
 
Identifying and measuring testing debt
Identifying and measuring testing debtIdentifying and measuring testing debt
Identifying and measuring testing debtPeter Varhol
 
What aircrews can teach devops teams ignite
What aircrews can teach devops teams igniteWhat aircrews can teach devops teams ignite
What aircrews can teach devops teams ignitePeter Varhol
 
Talking to people lightning
Talking to people lightningTalking to people lightning
Talking to people lightningPeter Varhol
 
Varhol oracle database_firewall_oct2011
Varhol oracle database_firewall_oct2011Varhol oracle database_firewall_oct2011
Varhol oracle database_firewall_oct2011Peter Varhol
 
Qa test managed_code_varhol
Qa test managed_code_varholQa test managed_code_varhol
Qa test managed_code_varholPeter Varhol
 
Testing a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatraceTesting a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatracePeter Varhol
 
Talking to people: the forgotten DevOps tool
Talking to people: the forgotten DevOps toolTalking to people: the forgotten DevOps tool
Talking to people: the forgotten DevOps toolPeter Varhol
 
Moneyball peter varhol_starwest2012
Moneyball peter varhol_starwest2012Moneyball peter varhol_starwest2012
Moneyball peter varhol_starwest2012Peter Varhol
 

Plus de Peter Varhol (16)

Not fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational valuesNot fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational values
 
DevOps and the Impostor Syndrome
DevOps and the Impostor SyndromeDevOps and the Impostor Syndrome
DevOps and the Impostor Syndrome
 
Not fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational valuesNot fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational values
 
162 the technologist of the future
162   the technologist of the future162   the technologist of the future
162 the technologist of the future
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Digital transformation through devops dod indianapolis
Digital transformation through devops dod indianapolisDigital transformation through devops dod indianapolis
Digital transformation through devops dod indianapolis
 
Testing for cognitive bias in ai systems
Testing for cognitive bias in ai systemsTesting for cognitive bias in ai systems
Testing for cognitive bias in ai systems
 
What Aircrews Can Teach Testing Teams
What Aircrews Can Teach Testing TeamsWhat Aircrews Can Teach Testing Teams
What Aircrews Can Teach Testing Teams
 
Identifying and measuring testing debt
Identifying and measuring testing debtIdentifying and measuring testing debt
Identifying and measuring testing debt
 
What aircrews can teach devops teams ignite
What aircrews can teach devops teams igniteWhat aircrews can teach devops teams ignite
What aircrews can teach devops teams ignite
 
Talking to people lightning
Talking to people lightningTalking to people lightning
Talking to people lightning
 
Varhol oracle database_firewall_oct2011
Varhol oracle database_firewall_oct2011Varhol oracle database_firewall_oct2011
Varhol oracle database_firewall_oct2011
 
Qa test managed_code_varhol
Qa test managed_code_varholQa test managed_code_varhol
Qa test managed_code_varhol
 
Testing a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatraceTesting a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatrace
 
Talking to people: the forgotten DevOps tool
Talking to people: the forgotten DevOps toolTalking to people: the forgotten DevOps tool
Talking to people: the forgotten DevOps tool
 
Moneyball peter varhol_starwest2012
Moneyball peter varhol_starwest2012Moneyball peter varhol_starwest2012
Moneyball peter varhol_starwest2012
 

Dernier

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Dernier (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Making disaster routine

  • 1. Making Disaster Routine Anticipating and Practicing Failures Using Active Monitoring and Chaos Engineering Peter Varhol and Gerie Owen
  • 2. About me • International speaker and writer • Graduate degrees in Math, CS, Psychology • Technology communicator • Former university professor, tech journalist • Cat owner and distance runner • peter@petervarhol.com
  • 3. Gerie Owen 3 • QA Evangelist, test manager • Subject matter expert on testing for TechTarget’s SearchSoftwareQuality.com • International and domestic conference presenter • Marathon runner & running coach gowen@qualitestgroup.com
  • 4. Agenda • DevOps and disaster • Preparing for disaster • Principles of chaos • Monitoring for disaster • Getting back on your feet • Conclusions
  • 5. What is DevOps? • Containerized development, rapid iteration with real-time performance insights, intelligent feedback, diagnostic services, an integrated DevOps pipeline, and deployment to the cloud • Boshe moi! • In layman’s terms: • We automatically integrate and build every time there is a valid check-in • We run automated tests at all stages, including production • We send the app to production when it has been integrated and tested • Automation makes it all work like a Swiss watch
  • 6. What is a Disaster? • A serious disruption, occurring over a relatively short time, loss and impacts, which exceeds the ability of the affected community or society to cope using its own resources. • Disruption • Short timeframe • Exceeds the ability to cope
  • 7. What is a Disaster • Consistency becomes uncertain • Automated workflow breaks down • Build fails; smoke tests are blocked • Server farm goes offline • Application won’t start again • Showstopper bug in production • Anything that disrupts consistency
  • 8. Preparing for Disaster • We don’t react well when things go wrong • Disbelief • Uncertainty • Panic • How can we prepare for the unknown?
  • 9. We Can Learn from Aircrews • US Airways Flight 1549 • Sullenberger and Skiles had never met before that day • Yet worked from established procedures • Practiced for hundreds of hours • Immediately turned to checklists • 90 seconds after the bird strike, they were in the Hudson • You have to practice this
  • 10. We Can Learn From Aircrews • Indecision and panic are killers • Checklists drive decision-making by focusing on essentials • Courses of action are defined fast • Practice makes disasters just another day in the office • Clear and structured communications is essential
  • 11. We Can Also Practice Disaster • Chaos engineering • Failure scenarios • Application health monitoring
  • 12. Chaos Engineering • Distributed systems at scale • Experiments to uncover systemic weaknesses • Defining normal behavior • Set your null and alternative hypothesis • Introduce variables that reflect real world events • servers crash • hard drives malfunction • network connections lost • Try to disprove the null hypothesis
  • 13. Chaos Engineering • Practice in production • Vary real world events • Yes, there could be customer impact • It is incumbent upon the chaos engineer to minimize customer impact • But that is the point of the experiment
  • 14. Chaos Monkey • Now called Simian Army • Developed by Netflix • Causes breakdowns in their production environment • Now consists of a variety of tools • It’s all about resiliency • Can our application survive?
  • 15. Practice Failure Scenarios • Each team member contributes one or more scenarios • The more unlikely, the better • Write up the scenarios • Only the team leader sees them beforehand • They can be real failures experienced or thought exercises
  • 16. Practice Failure Scenarios • Describe the scenarios to the team • “Load is remaining constant but performance is gradually deteriorating. We’re starting to get 404 and related errors. The server farm seems to be operating correctly; it’s an application issue. Pings are slowing down, but not drastically.” • How do we diagnose and address?
  • 17. We Don’t Need Another Hero • Heroes use superhuman efforts to fix a disaster • In doing so, they break with team conventions • Teams function better together • If a team has a hero: • the team may not try as hard in the future • the hero is not replicable • the hero can’t solve every problem
  • 18. Monitor Application Health in Production • Ping just doesn’t cut it any more • Availability and performance data • Synthetic testing • Health over time • Track trends of performance, page painting, database calls • Whatever might give you health trends
  • 19. Directions for Monitoring • Watermarks for action • E.g., 25 percent of pages take longer than 2 seconds to load • AI for prediction • Based on similar results in the past, the application is likely to fail in six hours • Analytics for trends • A combination of six measures indicates unhealthy trends
  • 20. The Power of Checklists • Checklists are part of our daily lives • They • relieve the cognitive load of remembering to do’s • organize complicated decision-making • reduce risk in complicated activities by ensuring that critical tasks are not overlooked.
  • 22. Using Checklists in DevOps • Checklists can be used to: • Replace Test Cases • Supplement Test Cases • Verify Entry and Exit Criteria • Sanity Testing • Ambiguity Reviews • Dev Estimates
  • 23. Types of Checklists • Project Set Up • Application Specific Regression • Process type specific • Website Graphics • Browser Dependencies • Usability checks
  • 24. What Does Thinking Of Failure Accomplish? • Failure doesn’t come as a surprise • It does so all too often • We have procedures to deal with failure • We have practice dealing with failure • Failure is just another day at the office
  • 25. A Final Lesson • You are not alone
  • 26. Conclusions • Things will go wrong • Don’t yell or panic • Practice non-conforming situations regularly • Make up unlikely scenarios; chances are they will happen • Structured practices and communications may make work boring, but they help when things start going wrong • Ease into chaos engineering for resiliency • Use your experiences to create checklists