SlideShare une entreprise Scribd logo
1  sur  46
@garethbowles
I Don’t Test Often …
… But When I Do, I Test in Production
@garethbowles
Building distributed systems is hard
esting them exhaustively is even harde
@garethbowles
Gareth Bowles
@garethbowles
@garethbowles
television network with more than 57 million
members in 50 countries enjoying more
than one billion hours of TV shows and
movies per month.
We account for up to 34% of downstream
US internet traffic. Source: http://ir.netflix.com
@garethbowles
Personaliza-
tion Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
2B requests
per day
into the Netflix
API
12B outbound
requests per
day to API
dependencies
@garethbowles
A Complex Distributed
System
@garethbowles
Our Deployment Platform
@garethbowles
What AWS Provides
• Machine Images (AMI)
• Instances (EC2)
• Elastic Load Balancers
• Security groups / Autoscaling
groups
• Availability zones and regions
@garethbowles
…No (single) body knows how everything
works.
Our system has become so complex
that…
@garethbowles
How AWS Can Go Wrong -1
• Service goes down in one or more
availability zones
• 6/29/12 - storm related power outage
caused loss of EC2 and RDS instances
in Eastern US
• https://gigaom.com/2012/06/29/some-of-
amazon-web-services-are-down-again/
@garethbowles
How AWS Can Go Wrong - 2
• Loss of service in an entire region
• 12/24/12 - operator error caused loss of
multiple ELBs in Eastern US
• http://techblog.netflix.com/2012/12/a-
closer-look-at-christmas-eve-outage.html
@garethbowles
How AWS Can Go Wrong - 3
• Large number of instances get rebooted
• 9/25/14 to 9/30/14 - rolling reboot of
1000s of instances to patch a security
bug
• http://techblog.netflix.com/2014/10/a-
state-of-xen-chaos-monkey-
cassandra.html
@garethbowles
Our Goal is Availability
• Members can stream Netflix whenever
they want
• New users can explore and sign up
• New members can activate their service
and add devices
@garethbowles
http://www.slideshare.net/reed2001/culture-1798664
@garethbowles
Freedom and Responsibility
• Developers deploy when
they want
• They also manage their
own capacity and
autoscaling
• And are on-call to fix
anything that breaks at
3am!
@garethbowles
How the heck do you test this stuff
?
@garethbowles
Failure is All Around Us
• Disks fail
• Power goes out - and your backup
generator fails
• Software bugs are introduced
• People make mistakes
@garethbowles
Design to Avoid Failure
• Exception handling
• Redundancy
• Fallback or degraded experience (circuit
breakers)
• But is it enough ?
@garethbowles
It’s Not Enough
• How do we know we’ve succeeded ?
• Does the system work as designed ?
• Is it as resilient as we believe ?
• How do we avoid drifting into failure ?
@garethbowles
More Testing !
• Unit
• Integration
• Stress
• Exhaustive testing to simulate all failure
modes
@garethbowles
Exhaustive Testing ~
Impossible
• Massive, rapidly changing data sets
• Internet scale traffic
• Complex interaction and information flow
• Independently-controlled services
• All while innovating and building features
@garethbowles
Another Way
• Cause failure deliberately to validate
resiliency
• Test design assumptions by stressing
them
• Don’t wait for random failure. Remove
its uncertainty by forcing it regularly
@garethbowles
Introducing the Army
@garethbowles
Chaos Monkey
@garethbowles
Chaos Monkey
• The original Monkey (2009)
• Randomly terminates instances in a cluster
• Simulates failures inherent to running in the
cloud
• During business hours
• Default for production services
@garethbowles
What did we do once we were
able to handle Chaos Monkey ?
Bring in bigger Monkeys !
@garethbowles
Chaos Gorilla
@garethbowles
Chaos Gorilla
• Simulate an Availability Zone becoming
unavailable
• Validate multi-AZ redundancy
• Deploy to multiple AZs by default
• Run regularly (but not continually !)
@garethbowles
Chaos Kong
@garethbowles
Chaos Kong
• “One louder” than Chaos Gorilla
• Simulate an entire region outage
• Used to validate our “active-active” region
strategy
• Traffic has to be switched to the new region
• Run once every few months
@garethbowles
Latency Monkey
@garethbowles
Latency Monkey• Simulate degraded instances
• Ensure degradation doesn’t affect other
services
• Multiple scenarios: network, CPU, I/O,
memory
• Validate that your service can handle
degradation
• Find effects on other services, then validate
that they can handle it too
@garethbowles
Conformity Monkey
@garethbowles
Conformity Monkey• Apply a set of conformity rules to all
instances
• Notify owners with a list of instances and
problems
• Example rules
• Standard security groups not applied
• Instance age is too old
• No health check URL
@garethbowles
Failure Injection Testing (FIT)
• Latency Monkey adds delay / failure on server side of
requests
• Impacts all calling apps - whether they want to
participate or not
• FIT decorates requests with failure data
• Can limit failures to specific accounts or devices, then
dial up
• http://techblog.netflix.com/2014/10/fit-failure-injection-
testing.html
@garethbowles
Try it out !
• Open sourced and available at
https://github.com/Netflix/SimianArmy
and
https://github.com/Netflix/security_monke
y
• Chaos, Conformity, Janitor and Security
available now; more to come
• VMware as well as AWS
@garethbowles
What’s Next ?
• New failure modes
• Run monkeys more frequently and
aggressively
• Make chaos testing as well-understood
as regular regression testing
@garethbowles
A message from the owners
“Use Chaos Monkey to induce various kinds of
failures in a controlled environment.”
AWS blog post following the mass instance
reboot in Sep 2014:
http://aws.amazon.com/blogs/aws/ec2-
maintenance-update-2/
@garethbowles
Production Code Coverage
If you don’t run code coverage analysis in prod,
you’re not doing it properly
@garethbowles
What We Get
• Real time code usage patterns
• Focus testing by prioritizing frequently
executed paths with low test coverage
• Identify dead code that can be removed
@garethbowles
How We Do It
• Use Cobertura as it counts how many
times each LOC is executed
• Easy to enable - Cobertura JARs
included in our base AMI, set a flag to
add them to Tomcat’s classpath
• Enable on a single instance
• Very low performance hit
@garethbowles
Canary Deployments
@garethbowles
Canaries
• Push changes to a small number of
instances
• Use Asgard for red / black push
• Monitor closely to detect regressions
• Automated canary analysis
• Automatically cancel deployment if
problems occur
@garethbowles
Closing Thoughts
• Don’t be scared to test in production !
• You’ll get tons of data that you couldn’t get
from test …
• … and hopefully sleep better at night
@garethbowles
Thanks, QA or the Highway !
Email: gbowles@{gmail,netflix}.com
Twitter: @garethbowles
Linkedin:
www.linkedin.com/in/garethbowles

Contenu connexe

Tendances

NLUUG print conference May 26 2016
NLUUG print conference May 26 2016NLUUG print conference May 26 2016
NLUUG print conference May 26 2016Igmar Palsenberg
 
ElasticBeanstalk で新規事業を爆速ローンチする
ElasticBeanstalk で新規事業を爆速ローンチするElasticBeanstalk で新規事業を爆速ローンチする
ElasticBeanstalk で新規事業を爆速ローンチするRyo Shibayama
 
Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断Ryo Shibayama
 
Micro Services - Small is Beautiful
Micro Services - Small is BeautifulMicro Services - Small is Beautiful
Micro Services - Small is BeautifulEberhard Wolff
 
Immutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesC4Media
 
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLARiot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLAsean_seannery
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.Dianne Marsh
 
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...Andreas Grabner
 
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupIs Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupChase Douglas
 
The Netflix API for a global service
The Netflix API for a global serviceThe Netflix API for a global service
The Netflix API for a global serviceKatharina Probst
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgNils Meder
 
Security as Code: DOES15
Security as Code: DOES15Security as Code: DOES15
Security as Code: DOES15Ed Bellis
 
Scaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with DockerScaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with DockerAtlassian
 
Micro Service – The New Architecture Paradigm
Micro Service – The New Architecture ParadigmMicro Service – The New Architecture Paradigm
Micro Service – The New Architecture ParadigmEberhard Wolff
 
Nordstrom Data Lab Recommendo API with Node.js
Nordstrom Data Lab Recommendo API with Node.jsNordstrom Data Lab Recommendo API with Node.js
Nordstrom Data Lab Recommendo API with Node.jsDavid Von Lehman
 
Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixSaturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixDianne Marsh
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS Amazon Web Services
 
AWS Meetup - Nordstrom Data Lab and the AWS Cloud
AWS Meetup - Nordstrom Data Lab and the AWS CloudAWS Meetup - Nordstrom Data Lab and the AWS Cloud
AWS Meetup - Nordstrom Data Lab and the AWS CloudNordstromDataLab
 

Tendances (20)

NLUUG print conference May 26 2016
NLUUG print conference May 26 2016NLUUG print conference May 26 2016
NLUUG print conference May 26 2016
 
ElasticBeanstalk で新規事業を爆速ローンチする
ElasticBeanstalk で新規事業を爆速ローンチするElasticBeanstalk で新規事業を爆速ローンチする
ElasticBeanstalk で新規事業を爆速ローンチする
 
Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断
 
Micro Services - Small is Beautiful
Micro Services - Small is BeautifulMicro Services - Small is Beautiful
Micro Services - Small is Beautiful
 
Immutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine Images
 
Event driven infrastructure
Event driven infrastructureEvent driven infrastructure
Event driven infrastructure
 
Heroku
HerokuHeroku
Heroku
 
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLARiot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
 
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...
 
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupIs Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
 
The Netflix API for a global service
The Netflix API for a global serviceThe Netflix API for a global service
The Netflix API for a global service
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
 
Security as Code: DOES15
Security as Code: DOES15Security as Code: DOES15
Security as Code: DOES15
 
Scaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with DockerScaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with Docker
 
Micro Service – The New Architecture Paradigm
Micro Service – The New Architecture ParadigmMicro Service – The New Architecture Paradigm
Micro Service – The New Architecture Paradigm
 
Nordstrom Data Lab Recommendo API with Node.js
Nordstrom Data Lab Recommendo API with Node.jsNordstrom Data Lab Recommendo API with Node.js
Nordstrom Data Lab Recommendo API with Node.js
 
Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixSaturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS
 
AWS Meetup - Nordstrom Data Lab and the AWS Cloud
AWS Meetup - Nordstrom Data Lab and the AWS CloudAWS Meetup - Nordstrom Data Lab and the AWS Cloud
AWS Meetup - Nordstrom Data Lab and the AWS Cloud
 

En vedette

Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformSudhir Tonse
 
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Dianne Marsh
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepBruce Wong
 
Transition::IT -- Leadership and Cultural Change
Transition::IT -- Leadership and Cultural ChangeTransition::IT -- Leadership and Cultural Change
Transition::IT -- Leadership and Cultural Changemike d. kail
 
Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance TestingAjay Kumar Vaddadi
 
Past present and future of Recommender Systems: an Industry Perspective
Past present and future of Recommender Systems: an Industry PerspectivePast present and future of Recommender Systems: an Industry Perspective
Past present and future of Recommender Systems: an Industry PerspectiveXavier Amatriain
 
Scaling the Cloud - Cloud Security
Scaling the Cloud - Cloud SecurityScaling the Cloud - Cloud Security
Scaling the Cloud - Cloud SecurityBill Burns
 
Netflix security monkey overview
Netflix security monkey overviewNetflix security monkey overview
Netflix security monkey overviewRyan Hodgin
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityRTTS
 
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAriel Tseitlin
 
From resilient to antifragile - Chaos Engineering Primer DevSecCon
From resilient to antifragile - Chaos Engineering Primer DevSecConFrom resilient to antifragile - Chaos Engineering Primer DevSecCon
From resilient to antifragile - Chaos Engineering Primer DevSecConSergiu Bodiu
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty
 

En vedette (13)

Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single Step
 
Transition::IT -- Leadership and Cultural Change
Transition::IT -- Leadership and Cultural ChangeTransition::IT -- Leadership and Cultural Change
Transition::IT -- Leadership and Cultural Change
 
Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance Testing
 
Past present and future of Recommender Systems: an Industry Perspective
Past present and future of Recommender Systems: an Industry PerspectivePast present and future of Recommender Systems: an Industry Perspective
Past present and future of Recommender Systems: an Industry Perspective
 
Scaling the Cloud - Cloud Security
Scaling the Cloud - Cloud SecurityScaling the Cloud - Cloud Security
Scaling the Cloud - Cloud Security
 
Netflix security monkey overview
Netflix security monkey overviewNetflix security monkey overview
Netflix security monkey overview
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
 
From resilient to antifragile - Chaos Engineering Primer DevSecCon
From resilient to antifragile - Chaos Engineering Primer DevSecConFrom resilient to antifragile - Chaos Engineering Primer DevSecCon
From resilient to antifragile - Chaos Engineering Primer DevSecCon
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure Testing
 
Culture
CultureCulture
Culture
 

Similaire à I Don't Test Often ...

Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Andreas Grabner
 
London Atlassian User Group - February 2014
London Atlassian User Group - February 2014London Atlassian User Group - February 2014
London Atlassian User Group - February 2014Steve Smith
 
Cloud patterns forwardjs April Ottawa 2019
Cloud patterns forwardjs April Ottawa 2019Cloud patterns forwardjs April Ottawa 2019
Cloud patterns forwardjs April Ottawa 2019Taswar Bhatti
 
Continuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSContinuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSDaniel Woods
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018Taswar Bhatti
 
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Tomas Doran
 
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...garrett honeycutt
 
Tech Talk on Cloud Computing
Tech Talk on Cloud ComputingTech Talk on Cloud Computing
Tech Talk on Cloud ComputingITviec
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsGerald Versluis
 
Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Ariel Tseitlin
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
 
Immutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerImmutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerSam Bashton
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB
 
Continuous Deployment of your Application - SpringOne Tour Dallas
Continuous Deployment of your Application - SpringOne Tour DallasContinuous Deployment of your Application - SpringOne Tour Dallas
Continuous Deployment of your Application - SpringOne Tour DallasVMware Tanzu
 
Continuous Deployment of your Application @SpringOne
Continuous Deployment of your Application @SpringOneContinuous Deployment of your Application @SpringOne
Continuous Deployment of your Application @SpringOneciberkleid
 
Why real integration developers ride Camels
Why real integration developers ride CamelsWhy real integration developers ride Camels
Why real integration developers ride CamelsChristian Posta
 
Continuous Deployment of your Application @jSession#5
Continuous Deployment of your Application @jSession#5Continuous Deployment of your Application @jSession#5
Continuous Deployment of your Application @jSession#5Marcin Grzejszczak
 

Similaire à I Don't Test Often ... (20)

Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
 
London Atlassian User Group - February 2014
London Atlassian User Group - February 2014London Atlassian User Group - February 2014
London Atlassian User Group - February 2014
 
Cloud patterns forwardjs April Ottawa 2019
Cloud patterns forwardjs April Ottawa 2019Cloud patterns forwardjs April Ottawa 2019
Cloud patterns forwardjs April Ottawa 2019
 
Continuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSContinuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSS
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018
 
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)
 
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...
 
Tech Talk on Cloud Computing
Tech Talk on Cloud ComputingTech Talk on Cloud Computing
Tech Talk on Cloud Computing
 
Creating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applicationsCreating airplane mode proof (Xamarin) applications
Creating airplane mode proof (Xamarin) applications
 
Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Immutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerImmutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answer
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
 
Continuous Deployment of your Application - SpringOne Tour Dallas
Continuous Deployment of your Application - SpringOne Tour DallasContinuous Deployment of your Application - SpringOne Tour Dallas
Continuous Deployment of your Application - SpringOne Tour Dallas
 
Continuous Deployment of your Application @SpringOne
Continuous Deployment of your Application @SpringOneContinuous Deployment of your Application @SpringOne
Continuous Deployment of your Application @SpringOne
 
ChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptxChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptx
 
Why real integration developers ride Camels
Why real integration developers ride CamelsWhy real integration developers ride Camels
Why real integration developers ride Camels
 
Codemgmt
CodemgmtCodemgmt
Codemgmt
 
Continuous Deployment of your Application @jSession#5
Continuous Deployment of your Application @jSession#5Continuous Deployment of your Application @jSession#5
Continuous Deployment of your Application @jSession#5
 

Dernier

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

I Don't Test Often ...

Notes de l'éditeur

  1. Notes
  2. Hi, everyone. Thanks for coming today. This is my first keynote, so I hope I can do it justice ! Just to set some expectations - I heard that at Matt’s keynote last year, he was giving out $20 bills. I’m afraid I don’t have any cash, but I do have plenty of snazzy Simian Army stickers up here. I’m going to talk about some big testing challenges that we face at Netflix. In particular, we have such a large and complex distributed system that that testing it exhaustively in an isolated environment is next to impossible. To meet that challenge we came up with a few different approaches, and I’m going to talk about three of those today: the Simian Army, which is a set of tools that induces failures in production; code coverage analysis on production servers; and using canaries to test new versions in production. I’ll spend a bit of time talking about Netflix and our streaming service to set up the problem space, then go over some of the things we need to test for and how more traditional test practices can fall short. Finally I’ll go into some detail on the tools themselves.
  3. A little bit about me. I’ve been with Netflix for 4 1/2 years and I’m part of our Engineering Tools team. We’re responsible for developer productivity, with the goal that any engineer can build and deploy their code with the minimum possible effort. Before Netflix I spent a long time in test engineering and technical operations, so once I got to Netflix I was fascinated to see how such a complex system gets tested. Let’s take a look at how it all works.
  4. Who here ISN’T familiar with Netflix ? Any customers ? Thanks very much ! Netflix is first & foremost an entertainment company, but you can also look at us as an engineering company that creates all the technology to serve up that entertainment, and also collects a ton of data on who watches what, when they watch it, and how much they watch. We continuously analyze all that data to improve our customers’ entertainment experience by making it easy for them to find things they want to watch, and making sure they have a top quality viewing experience when they get comfy on the sofa (or the bus, or in the park, or wherever they can get connected). So there’s a lot of engineering that goes on behind the scenes to make all that possible.
  5. Some data that might be new to some of you guys. Our membership is growing fast; about two thirds of our members are in the US, but we’re now in more than 50 other countries - all countries in North and South America, plus most of Europe. The amount of content our viewers are watching is growing, too. And we’re doing our best to break the internet.
  6. This is an overview of our current architecture. 2 billion requests flow in from all kinds of connected devices - game consoles, PCs and Macs, phones, tablets, TVs, DVD players, and more. Those generate 12 billion outbound requests to individual services. The diagram shows some of the main ones: the personalization engine that recommends what to watch based on your viewing history and ratings, movie metadata to give you information about what you’re watching, and one of the most important - the A/B test engine that lets us serve up a different customer experience to a set of users and measure its effect on how much they watch, what they watch and when the watch it.
  7. I like to pause for audience reaction here. Anyone care to suggest what this diagram shows ? We call it the “Shock and Awe” diagram at Netflix ! It’s generated by one of our monitoring systems and shows the interconnections between all of our services and data. Don’t look too hard, you won’t make out much detail - it’s only meant to illustrate the complexity of the system.
  8. We run our production systems on Amazon Web Services, which many of you are probably familiar with. We’re one of AWS’ biggest customers - apart from Amazon itself, who uses AWS to power their e-commerce sites as well as their streaming service that competes with Netflix.
  9. Using Amazon Web Services lets us stop worrying about procurement of hardware - servers, network switches, storage, firewalls, load balancers ... AWS allows us to scale up and down without worrying about exceeding or underusing data center capacity. And since every AWS service has an API, we can automate our deployments and throw away all those runbooks. Each AWS service is available in multiple regions (geographic areas) and multiple availability zones (data centers) within each region.
  10. So now that I’ve described how everything works at a really high level, here’s a big problem - in actual fact, there’s nobody who knows how it all works in depth. Although we have world experts in areas such as personalization, video encoding and machine learning, there’s just too much going on and it’s changing too fast for any individual to keep up.
  11. AWS is increasingly reliable, but it has had some fairly spectacular outages as well as many smaller ones. When you run on a cloud platform that’s not under your control, you have to be able to cope with these outages. On June 29th 2012, a storm caused a widespread power outage in Northern Virginia that took out many instances and database servers. Netflix streaming was affected for a while.
  12. An even bigger outage happened on Christmas Eve 2012, when an Amazon engineer made a mistake that took out many Elastic Load Balancers in the US East region, which at that time was Netflix’ primary region for serving traffic. ELBs aren’t replicated between availability zones, they only apply to a given region. Many Netflix customers were affected; luckily for us, Christmas Eve is a much less busy day than Christmas Day when everyone gets given Netflix subscriptions, and the problem was fixed by the 25th.
  13. In late September this year, AWS restarted a large number of instances, in multiple regions, to patch a security bug in the virtualization software that the instances run on. This time we were hardly affected at all and there was negligible impact on customers - again, more in our tech blog post.
  14. Given those kinds of problems, we need to work pretty hard to deal with them. Netflix wants our 50 million plus members to be able to play movies and TV shows whenever and wherever they want. We also want to make it as simple and fast as possible for people to sign up and start using their new subscriptions.
  15. One little digression that’s an important part of how we meet these challenges - we couldn’t do it without our company culture, which we’re quite proud of - proud enough to publish the 126-slide deck I’ve linked here. All those slides can be boiled down to one key takeaway, which we call Freedom and Responsibility.
  16. Here’s how the “freedom & responsibility” principle applies to our technical development and deployment. Freedom - engineers deploy when and how often they need to, and control their own production capacity and scaling. Responsibility - every engineer in each service team is in the PagerDuty rotation in case things go wrong.
  17. So how can these teams be confident that their new versions will still work with all their dependencies, under all kinds of failure conditions ? It’s a very tough problem.
  18. Failure is unavoidable. We already saw some ways that our AWS platform can fail. Add to that bugs in our own software, and the inevitable human errors that you get when there are actual people involved in the development and deployment pipeline.
  19. We can do a lot to make our code handle failure gracefully. We can catch errors as exceptions and make sure they are handled in a way that doesn’t crash the code. We can run multiple instances of our services to avoid single points of failure. And we can use technologies such as the circuit breaker pattern to have services provide a degraded experience if one of their dependent services goes offline. For example, if our recommendation service is unavailable, we don’t show you a blank list of recommendations on your Netflix page - we fall back to a list of the most-watched content.
  20. But that only gets us so far. We want to make sure all our features work properly without waiting for customers to tell us they don’t. We want to know that we are as resilient as we think we are, without waiting for an outage to happen. Given the scale we run at, it’s effectively impossible to create a realistic test system for running load and reliability tests. We also want to be sure that the configuration of our services doesn’t diverge as we redeploy them over time - this can lead to errors that are very hard to debug given the thousands of instances that we run in production.
  21. So, most of you guys are testers and probably have this reaction - let’s do more testing ! But can we effectively simulate such a large- scale distributed system - and what’s more, can we predict every possible failure mode and encode it into our tests ?
  22. Today’s large internet systems have become too big and complex to just rely on traditional testing - but don’t get me wrong, all those types of testing I just mentioned have a very important place at Netflix, and we have some of the best test engineers in the business working on them. But here are some of the things they struggle with. It’s very hard to find realistic test data. It would be hugely expensive for us to create a similarly-sized copy of our production system for testing - not quite “copying the internet”, but getting there. Because teams deploy their own changes on different schedules, it’s difficult to keep up with changes and code them into integration tests.
  23. So we came up with the idea of deliberately triggering failures in production, to augment our more traditional testing. By causing our own failures on a known schedule, we can be prepared to deal with their effects and test our assumptions in a predictable way, rather than having a fire drill when a “real” outage happens.
  24. So with all that context done with, let’s take a look at the lovable monkeys who make up our Simian Army.
  25. Chaos Monkey was the one who started it all.
  26. Chaos Monkey has been around in some form for about 5 years. It’s a service that looks for groups of instances (known as clusters) of each of our services and picks a random instance to terminate, on a defined schedule and with a defined probability of termination. This simulates a fairly frequent thing in AWS (although not nearly as frequent as it used to be) where instances are terminated unexpectedly, usually due to a failure in the underlying hardware. We run Chaos Monkey during business hours so that engineers are on hand to diagnose and fix problems,rather than getting a 3am page. If you deploy a new Netflix service, Chaos Monkey will be enabled for it unless you explicitly turn it off. We’ve got to a point where Chaos Monkey instance terminations go virtually unnoticed.
  27. We didn’t want to stop once we were happy that we could deal with individual instances dying.
  28. Gorillas are bigger than monkeys and can carry bigger weapons.
  29. Chaos Gorilla takes out an entire Availability Zone. AWS has multiple regions in different parts of the world, such Eastern USA, Western USA, Asia Pacific and Western Europe. Each region has multiple Availability Zones. These are equivalent to physical data centers in different geographic locations - for example, the US East region is located in Virginia but has 3 separate Availability Zones. Running Chaos Gorilla ensures that our service is running correctly in multiple Availability Zones, and that we have sufficient capacity in each zone to handle our traffic load. Runs of Chaos Gorilla are announced ahead of time, and our Reliability Engineering team sets up an incident room where engineers from each service team can watch progress.
  30. So what next ? As we already picked the gorilla, we had to resort to a fictional creature to cause even bigger chaos.
  31. Once we were happy that we could survive an Availability Zone outage, we wanted to go a step further and see if we could cope with an entire region being taken out. This hasn’t happened in reality yet, but there’s a small possibility that it could - for example, the us-west-1 region is in Northern California, so a really big earthquake could feasibly take out all of its Availability Zones. And the Elastic Load Balancer outage at the end of 2012 did have the effect of bringing down a key service in an entire region. To handle this, we had to rearchitect to an “active-active” setup where we have complete copies of our services and data running in two different regions. If an Availability Zone goes down we just have to make sure we have enough capacity in the surviving zones to handle all our traffic, but if we lose a region we also have to reroute all traffic to the backup region. Chaos Kong gets an outing every few months, again with an incident room where engineers can watch progress and react to any problems.
  32. So we can deal with instances disappearing individually, or in large numbers. But what if instances are still there, but running in a degraded state ? Because our architecture involves so many interdependent services, we have to be careful that problems with one service don’t cascade to other services.
  33. This is where Latency Monkey comes in. It can simulate multiple types of degradation: network connections maxing out, high CPU loads, high disk I/O and running out of memory. Degradation in a service oriented architecture is extremely hard to test exhaustively. With Latency Monkey we can introduce degradation in a controlled way (like the recommendations example I mentioned earlier), find any problems in dependent services and fix them, then verify those fixes. Some service teams even discovered dependencies they didn’t know they had by running Latency Monkey and finding unexpected degradation in the dependent services.
  34. One other key aspect of running so many services, all with dozens or hundreds of instances, is that it’s very important to keep all of the instances consistent. They should all have the same system configuration and the same version and configuration of the service, for example. This decreases the complexity / surface area of the testing
  35. We use Conformity Monkey to automate these checks. It runs over all services at a fixed interval, and notifies service owners when any of the conditions are not met. An email is sent containing a list of rule violations, each with a list of non-conforming instances. Here are a few examples of the things we check: Instances should be in the correct security groups so that they are reachable by other services and our monitoring and deployment tools. Instances shouldn’t have been running for more than a given time, which depends on how often the service is deployed. Older instances could be running an out of date version of the service. Instances should all have a valid health check URL so that our monitoring tools can know whether they are running properly.
  36. We just came up with a system called FIT, for Failure Injection Testing. We need to come up with a monkey for this one ! Latency Monkey injects delays or failures on the server side and thus affects all calling services. If all those calling services don’t have proper fallbacks and timeouts implemented, they can stop working and impact customers - which we obviously want to avoid. FIT allows failures to be simulated on the client side. We can add failure data to a specific set of API calls and propagate it through the system so that only the services we want to test are affected. We usually start with a specific test customer account or a particular client device, then dial up the failures to affect more and more production traffic if the initial results look good. Check the Netflix Tech Blog for more details.
  37. You can try the Monkeys out for yourself by going to the Netflix GitHub page. There’s an active community of users, and Netflix engineers regularly monitor the mailing list. Chaos, Conformity, Janitor and Security Monkeys are currently available, with more to come. For those of you not using AWS but with an in house VMWare setup - you can use the Monkeys too, thanks to some of our open source contributors.
  38. There’s a lot more we want to do with the Monkeys. We’d like to have a way to induce failures that are more chaotic than the individual instances that Chaos Monkey knocks out, but less impactful than having Chaos Gorilla take out an entire availability zone. We’re constantly on the lookout for new failure modes that we can trigger - hopefully before they happen in the wild. We’re working on an effort to increase the frequency and reach of monkey runs. This is in response to some interesting data - our uptime degraded when we ran the monkeys less frequently. Eventually, we’d like to make chaos testing in large distributed systems as well understood and commonly practiced as regular regression testing.
  39. After the mass instance reboot I mentioned early, Amazon themselves recommended that AWS customers use Chaos Monkey to test their resilience. High praise indeed !
  40. Let’s move on to our second way of testing in production - code coverage analysis. Our view is that if you’re not doing it in prod, you’re missing out on a ton of useful data.
  41. In contrast to just showing what code paths are covered by test, we get data on the paths which are actually used in production, plus how often each path is run. We compare the production code coverage data with the results from our test environment. This enables us to focus our testing, for example by identifying commonly used code paths with low test coverage. We can also find dead code that is never used in production, and remove that code and its tests to make maintenance easier.
  42. All of our services run on the JVM, so we needed a Java code coverage tool. We picked Cobertura because it counts how many times each line of code is executed, in contrast to most other tools which just give you a binary result of whether or not the line was executed. We put the Cobertura JAR files on our base machine image that all of the services build on, and set a flag at runtime to enable code coverage analysis. We’ll usually run code coverage on a single instance in a service cluster, and leave it running for a day or so to make sure that all the code paths are hit. We’ve found the performance hit to be very low - typically less than 5% degradation in performance.
  43. Our third way of testing in production is the use of canary deployments. The term came from coal mining, when miners wanted to detect dangerous levels of coal gas in the mine shaft. They would take down a canary, which was very sensitive to the gas; if the canary keeled over or died, it was time to get out of there before the miners did the same.
  44. Automated rollback happens near beginning of canary if there is a drastic regression - we’re still learning what is “bad enough”, varies by team. If regression is less severe, team will make decision to put the new service in prod, or cancel the push. Each team can have a different level of tolerance for regressions. Teams that deploy very frequently will tend to have a lower tolerance than teams who deploy less often and maybe don’t have their automated analysis fully developed.
  45. And that’s the end of my talk; before we go to some questions, I’d like to give a big thanks to the organizers for putting on such a great conference, and to all of you for coming. I’ve been really impressed by what a great testing community you have here.