SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
@garethbowle 
s 
Release the Monkeys ! 
Testing in the Wild @Netflix
Building distributed systems is hard 
Testing them exhaustively is even harder 
@garethbowle 
s
@garethbowle 
s 
Gareth Bowles
@garethbowle 
s
@garethbowle 
s 
Netflix is the world’s leading Internet 
television network with more than 50 million 
members in 50 countries enjoying more 
than one billion hours of TV shows and 
movies per month. 
We account for up to 34% of downstream 
US internet traffic. 
Source: http://ir.netflix.com
@garethbowle 
s 
Personaliza-tion 
Engine 
User Info 
Movie 
Metadata 
Movie 
Ratings 
Similar 
Movies 
API 
Reviews 
A/B Test 
Engine 
2B requests 
per day 
into the Netflix 
API 
12B outbound 
requests per 
day to API 
dependencies
@garethbowle 
s 
A Complex Distributed 
System
@garethbowle 
s 
Our Deployment Platform
@garethbowle 
s 
What AWS Provides 
• Machine Images (AMI) 
• Instances (EC2) 
• Elastic Load Balancers 
• Security groups / Autoscaling groups 
• Availability zones and regions
Our system has become so complex 
that… 
…No (single) body knows how everything 
@garethbowle 
s 
works.
@garethbowle 
s 
How AWS Can Go Wrong -1 
• Service goes down in one or more 
availability zones 
• 6/29/12 - storm related power outage 
caused loss of EC2 and RDS instances in 
Eastern US 
• https://gigaom.com/2012/06/29/some-of-amazon- 
web-services-are-down-again/
How AWS Can Go Wrong - 2 
@garethbowle 
s 
• Loss of service in an entire region 
• 12/24/12 - operator error caused loss of 
multiple ELBs in Eastern US 
• http://techblog.netflix.com/2012/12/a-closer- 
look-at-christmas-eve-outage.html
How AWS Can Go Wrong - 3 
@garethbowle 
s 
• Large number of instances get rebooted 
• 9/25/14 to 9/30/14 - rolling reboot of 1000s 
of instances to patch a security bug 
• http://techblog.netflix.com/2014/10/a-state-of- 
xen-chaos-monkey-cassandra.html
@garethbowle 
s 
Our Goal is Availability 
• Members can stream Netflix whenever they 
want 
• New users can explore and sign up 
• New members can activate their service 
and add devices
@garethbowle 
s 
http://www.slideshare.net/reed2001/culture-1798664
@garethbowle 
s 
Freedom and Responsibility 
• Developers deploy when 
they want 
• They also manage their 
own capacity and 
autoscaling 
• And are on-call to fix 
anything that breaks at 
3am!
How the heck do you test this stuff 
@garethbowle 
s 
?
@garethbowle 
s 
Failure is All Around Us 
• Disks fail 
• Power goes out - and your backup 
generator fails 
• Software bugs are introduced 
• People make mistakes
@garethbowle 
s 
Design to Avoid Failure 
• Exception handling 
• Redundancy 
• Fallback or degraded experience (circuit 
breakers) 
• But is it enough ?
@garethbowle 
s 
It’s Not Enough 
• How do we know we’ve succeeded ? 
• Does the system work as designed ? 
• Is it as resilient as we believe ? 
• How do we avoid drifting into failure ?
@garethbowle 
s 
More Testing ! 
• Unit 
• Integration 
• Stress 
• Exhaustive testing to simulate all failure 
modes
@garethbowle 
s 
Exhaustive Testing ~ 
Impossible 
• Massive, rapidly changing data sets 
• Internet scale traffic 
• Complex interaction and information flow 
• Independently-controlled services 
• All while innovating and building features
@garethbowle 
s 
Another Way 
• Cause failure deliberately to validate 
resiliency 
• Test design assumptions by stressing them 
• Don’t wait for random failure. Remove its 
uncertainty by forcing it regularly
@garethbowle 
s 
Introducing the Army
@garethbowle 
s 
Chaos Monkey
@garethbowle 
s 
Chaos Monkey 
• The original Monkey (2009) 
• Randomly terminates instances in a cluster 
• Simulates failures inherent to running in the 
cloud 
• During business hours 
• Default for production services
@garethbowle 
s 
What did we do once we 
were able to handle Chaos 
Monkey ? 
Bring in bigger Monkeys !
@garethbowle 
s 
Chaos Gorilla
@garethbowle 
s 
Chaos Gorilla 
• Simulate an Availability Zone becoming 
unavailable 
• Validate multi-AZ redundancy 
• Deploy to multiple AZs by default 
• Run regularly (but not continually !)
@garethbowle 
s 
Chaos Kong
@garethbowle 
s 
Chaos Kong 
• “One louder” than Chaos Gorilla 
• Simulate an entire region outage 
• Used to validate our “active-active” region 
strategy 
• Traffic has to be switched to the new region 
• Run once every few months
@garethbowle 
s 
Latency Monkey
@garethbowle 
s 
• Simulate dLegaratdeend cinyst aMnceosnkey 
• Ensure degradation doesn’t affect other 
services 
• Multiple scenarios: network, CPU, I/O, 
memory 
• Validate that your service can handle 
degradation 
• Find effects on other services, then validate 
that they can handle it too
@garethbowle 
s 
Conformity Monkey
@garethbowle 
s 
• Conformity Monkey Apply a set of conformity rules to all 
instances 
• Notify owners with a list of instances and 
problems 
• Example rules 
• Standard security groups not applied 
• Instance age is too old 
• No health check URL
@garethbowle 
s 
Some More Monkeys 
• Janitor Monkey - clean up unused 
resources 
• Security Monkey - analyze, audit and 
notify on AWS security profile changes 
• Howler Monkey - warn if we’re reaching 
resource limits
@garethbowle 
s 
Try it out ! 
• Open sourced and available at 
https://github.com/Netflix/SimianArmy and 
https://github.com/Netflix/security_monkey 
• Chaos, Conformity, Janitor and Security 
available now; more to come 
• Ported to VMWare
@garethbowle 
s 
What’s Next ? 
• Finer grained control 
• New failure modes 
• Make chaos testing as well-understood as 
regular regression testing
@garethbowle 
s 
A message from the owners 
“Use Chaos Monkey to induce various kinds of 
failures in a controlled environment.” 
AWS blog post following the mass instance 
reboot in Sep 2014: 
http://aws.amazon.com/blogs/aws/ec2- 
maintenance-update-2/
@garethbowle 
s 
Thank You ! 
Email: gbowles@{gmail,netflix}.com 
Twitter: @garethbowles 
Linkedin: 
www.linkedin.com/in/garethbowles 
We’re hiring - test engineers, and chaos 
engineers ! http://jobs.netflix.com

Contenu connexe

Tendances

Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断Ryo Shibayama
 
NLUUG print conference May 26 2016
NLUUG print conference May 26 2016NLUUG print conference May 26 2016
NLUUG print conference May 26 2016Igmar Palsenberg
 
Docker, Continuous Integration, and You
Docker, Continuous Integration, and YouDocker, Continuous Integration, and You
Docker, Continuous Integration, and YouAtlassian
 
JUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our cultureJUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our culturePeter Mounce
 
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupIs Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupChase Douglas
 
Immutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesC4Media
 
Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixSaturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixDianne Marsh
 
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...Peter Leschev
 
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLARiot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLAsean_seannery
 
Scaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with DockerScaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with DockerAtlassian
 
London Atlassian User Group - February 2014
London Atlassian User Group - February 2014London Atlassian User Group - February 2014
London Atlassian User Group - February 2014Steve Smith
 
Si fa presto a dire serverless
Si fa presto a dire serverlessSi fa presto a dire serverless
Si fa presto a dire serverlessAlessio Coser
 
[131] packetbeat과 elasticsearch
[131] packetbeat과 elasticsearch[131] packetbeat과 elasticsearch
[131] packetbeat과 elasticsearchNAVER D2
 
Security as Code: DOES15
Security as Code: DOES15Security as Code: DOES15
Security as Code: DOES15Ed Bellis
 
Zero to #Serverless in 60 seconds, anywhere
Zero to #Serverless in 60 seconds, anywhereZero to #Serverless in 60 seconds, anywhere
Zero to #Serverless in 60 seconds, anywhereAlex Ellis
 
Deploying systems using AWS DevOps tools
Deploying systems using AWS DevOps toolsDeploying systems using AWS DevOps tools
Deploying systems using AWS DevOps toolsMassTLC
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS Amazon Web Services
 
Docker Cambridge: Serverless Functions Made Simple with OpenFaaS
Docker Cambridge: Serverless Functions Made Simple with OpenFaaSDocker Cambridge: Serverless Functions Made Simple with OpenFaaS
Docker Cambridge: Serverless Functions Made Simple with OpenFaaSAlex Ellis
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.Dianne Marsh
 
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)Domas Lasauskas
 

Tendances (20)

Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断Amazon inspector で自動セキュリティ診断
Amazon inspector で自動セキュリティ診断
 
NLUUG print conference May 26 2016
NLUUG print conference May 26 2016NLUUG print conference May 26 2016
NLUUG print conference May 26 2016
 
Docker, Continuous Integration, and You
Docker, Continuous Integration, and YouDocker, Continuous Integration, and You
Docker, Continuous Integration, and You
 
JUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our cultureJUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our culture
 
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupIs Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
 
Immutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine Images
 
Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixSaturn 2014. Engineering Velocity: Continuous Delivery at Netflix
Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix
 
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
 
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLARiot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
 
Scaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with DockerScaling Your First 1000 Containers with Docker
Scaling Your First 1000 Containers with Docker
 
London Atlassian User Group - February 2014
London Atlassian User Group - February 2014London Atlassian User Group - February 2014
London Atlassian User Group - February 2014
 
Si fa presto a dire serverless
Si fa presto a dire serverlessSi fa presto a dire serverless
Si fa presto a dire serverless
 
[131] packetbeat과 elasticsearch
[131] packetbeat과 elasticsearch[131] packetbeat과 elasticsearch
[131] packetbeat과 elasticsearch
 
Security as Code: DOES15
Security as Code: DOES15Security as Code: DOES15
Security as Code: DOES15
 
Zero to #Serverless in 60 seconds, anywhere
Zero to #Serverless in 60 seconds, anywhereZero to #Serverless in 60 seconds, anywhere
Zero to #Serverless in 60 seconds, anywhere
 
Deploying systems using AWS DevOps tools
Deploying systems using AWS DevOps toolsDeploying systems using AWS DevOps tools
Deploying systems using AWS DevOps tools
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS
 
Docker Cambridge: Serverless Functions Made Simple with OpenFaaS
Docker Cambridge: Serverless Functions Made Simple with OpenFaaSDocker Cambridge: Serverless Functions Made Simple with OpenFaaS
Docker Cambridge: Serverless Functions Made Simple with OpenFaaS
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
 
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
Serverless in production, an experience report (Going Serverless, 28 Feb 2018)
 

En vedette

ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012Amazon Web Services
 
Netflix security monkey overview
Netflix security monkey overviewNetflix security monkey overview
Netflix security monkey overviewRyan Hodgin
 
Architecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureArchitecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureLen Bass
 
From Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at NetflixFrom Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at NetflixDianne Marsh
 
Cloud Security At Netflix, October 2013
Cloud Security At Netflix, October 2013Cloud Security At Netflix, October 2013
Cloud Security At Netflix, October 2013Jay Zarfoss
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty
 
Chaos Testing with F# and Azure by Rachel Reese at Codemotion Dubai
Chaos Testing with F# and Azure by Rachel Reese at Codemotion DubaiChaos Testing with F# and Azure by Rachel Reese at Codemotion Dubai
Chaos Testing with F# and Azure by Rachel Reese at Codemotion DubaiCodemotion Dubai
 
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Dianne Marsh
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformSudhir Tonse
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systemsLen Bass
 
Refactoring for Software Architecture Smells - International Workshop on Refa...
Refactoring for Software Architecture Smells - International Workshop on Refa...Refactoring for Software Architecture Smells - International Workshop on Refa...
Refactoring for Software Architecture Smells - International Workshop on Refa...Ganesh Samarthyam
 
Practical Security Automation
Practical Security AutomationPractical Security Automation
Practical Security AutomationJason Chan
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepBruce Wong
 
Netflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & CassandraNetflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & CassandraDataStax Academy
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
 
Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance TestingAjay Kumar Vaddadi
 
AWS Black Belt Techシリーズ AWS Key Management Service
AWS Black Belt Techシリーズ AWS Key Management ServiceAWS Black Belt Techシリーズ AWS Key Management Service
AWS Black Belt Techシリーズ AWS Key Management ServiceAmazon Web Services Japan
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS LambdaAmazon Web Services
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyWilliam Yang
 

En vedette (20)

ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012
 
Netflix security monkey overview
Netflix security monkey overviewNetflix security monkey overview
Netflix security monkey overview
 
Architecture for the cloud deployment case study future
Architecture for the cloud deployment case study futureArchitecture for the cloud deployment case study future
Architecture for the cloud deployment case study future
 
From Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at NetflixFrom Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at Netflix
 
Cloud Security At Netflix, October 2013
Cloud Security At Netflix, October 2013Cloud Security At Netflix, October 2013
Cloud Security At Netflix, October 2013
 
PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure Testing
 
Chaos Testing with F# and Azure by Rachel Reese at Codemotion Dubai
Chaos Testing with F# and Azure by Rachel Reese at Codemotion DubaiChaos Testing with F# and Azure by Rachel Reese at Codemotion Dubai
Chaos Testing with F# and Azure by Rachel Reese at Codemotion Dubai
 
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
Refactoring for Software Architecture Smells - International Workshop on Refa...
Refactoring for Software Architecture Smells - International Workshop on Refa...Refactoring for Software Architecture Smells - International Workshop on Refa...
Refactoring for Software Architecture Smells - International Workshop on Refa...
 
presentation-chaos-monkey
presentation-chaos-monkeypresentation-chaos-monkey
presentation-chaos-monkey
 
Practical Security Automation
Practical Security AutomationPractical Security Automation
Practical Security Automation
 
The Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single StepThe Journey of Chaos Engineering Begins with a Single Step
The Journey of Chaos Engineering Begins with a Single Step
 
Netflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & CassandraNetflix: A State of Xen - Chaos Monkey & Cassandra
Netflix: A State of Xen - Chaos Monkey & Cassandra
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
Automated Fault Tolerance Testing
Automated Fault Tolerance TestingAutomated Fault Tolerance Testing
Automated Fault Tolerance Testing
 
AWS Black Belt Techシリーズ AWS Key Management Service
AWS Black Belt Techシリーズ AWS Key Management ServiceAWS Black Belt Techシリーズ AWS Key Management Service
AWS Black Belt Techシリーズ AWS Key Management Service
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A Study
 

Similaire à Release the Monkeys ! Testing in the Wild at Netflix

I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesQA or the Highway
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Ariel Tseitlin
 
Voxxed Vienna 2015 Fault tolerant microservices
Voxxed Vienna 2015 Fault tolerant microservicesVoxxed Vienna 2015 Fault tolerant microservices
Voxxed Vienna 2015 Fault tolerant microservicesChristopher Batey
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
 
Rich, Real-time Mobile User Experiences @Devoxx UK
Rich, Real-time Mobile User Experiences @Devoxx UKRich, Real-time Mobile User Experiences @Devoxx UK
Rich, Real-time Mobile User Experiences @Devoxx UKAndy Piper
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Tomas Doran
 
Alvaro Videla, Building a Distributed Data Ingestion System with RabbitMQ
Alvaro Videla, Building a Distributed Data Ingestion System with RabbitMQAlvaro Videla, Building a Distributed Data Ingestion System with RabbitMQ
Alvaro Videla, Building a Distributed Data Ingestion System with RabbitMQTanya Denisyuk
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
 
Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Ariel Tseitlin
 
Continuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSContinuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSDaniel Woods
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Andreas Grabner
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebula Project
 
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超台灣資料科學年會
 
Boundary for puppet @ puppet conf2012
Boundary for puppet @ puppet conf2012Boundary for puppet @ puppet conf2012
Boundary for puppet @ puppet conf2012Boundary
 
LJC: Microservices in the real world
LJC: Microservices in the real worldLJC: Microservices in the real world
LJC: Microservices in the real worldChristopher Batey
 

Similaire à Release the Monkeys ! Testing in the Wild at Netflix (20)

I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
 
Voxxed Vienna 2015 Fault tolerant microservices
Voxxed Vienna 2015 Fault tolerant microservicesVoxxed Vienna 2015 Fault tolerant microservices
Voxxed Vienna 2015 Fault tolerant microservices
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Rich, Real-time Mobile User Experiences @Devoxx UK
Rich, Real-time Mobile User Experiences @Devoxx UKRich, Real-time Mobile User Experiences @Devoxx UK
Rich, Real-time Mobile User Experiences @Devoxx UK
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development (2 - puppetconf 2013 edition)
 
Alvaro Videla, Building a Distributed Data Ingestion System with RabbitMQ
Alvaro Videla, Building a Distributed Data Ingestion System with RabbitMQAlvaro Videla, Building a Distributed Data Ingestion System with RabbitMQ
Alvaro Videla, Building a Distributed Data Ingestion System with RabbitMQ
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013Resiliency through Failure @ OSCON 2013
Resiliency through Failure @ OSCON 2013
 
Continuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSSContinuous Delivery with NetflixOSS
Continuous Delivery with NetflixOSS
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
 
MLSEC 2020
MLSEC 2020MLSEC 2020
MLSEC 2020
 
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
 
Boundary for puppet @ puppet conf2012
Boundary for puppet @ puppet conf2012Boundary for puppet @ puppet conf2012
Boundary for puppet @ puppet conf2012
 
LJC: Microservices in the real world
LJC: Microservices in the real worldLJC: Microservices in the real world
LJC: Microservices in the real world
 

Dernier

UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimizationarrow10202532yuvraj
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Alexander Turgeon
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 

Dernier (20)

UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 

Release the Monkeys ! Testing in the Wild at Netflix

  • 1. @garethbowle s Release the Monkeys ! Testing in the Wild @Netflix
  • 2. Building distributed systems is hard Testing them exhaustively is even harder @garethbowle s
  • 5. @garethbowle s Netflix is the world’s leading Internet television network with more than 50 million members in 50 countries enjoying more than one billion hours of TV shows and movies per month. We account for up to 34% of downstream US internet traffic. Source: http://ir.netflix.com
  • 6. @garethbowle s Personaliza-tion Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine 2B requests per day into the Netflix API 12B outbound requests per day to API dependencies
  • 7. @garethbowle s A Complex Distributed System
  • 8. @garethbowle s Our Deployment Platform
  • 9. @garethbowle s What AWS Provides • Machine Images (AMI) • Instances (EC2) • Elastic Load Balancers • Security groups / Autoscaling groups • Availability zones and regions
  • 10. Our system has become so complex that… …No (single) body knows how everything @garethbowle s works.
  • 11. @garethbowle s How AWS Can Go Wrong -1 • Service goes down in one or more availability zones • 6/29/12 - storm related power outage caused loss of EC2 and RDS instances in Eastern US • https://gigaom.com/2012/06/29/some-of-amazon- web-services-are-down-again/
  • 12. How AWS Can Go Wrong - 2 @garethbowle s • Loss of service in an entire region • 12/24/12 - operator error caused loss of multiple ELBs in Eastern US • http://techblog.netflix.com/2012/12/a-closer- look-at-christmas-eve-outage.html
  • 13. How AWS Can Go Wrong - 3 @garethbowle s • Large number of instances get rebooted • 9/25/14 to 9/30/14 - rolling reboot of 1000s of instances to patch a security bug • http://techblog.netflix.com/2014/10/a-state-of- xen-chaos-monkey-cassandra.html
  • 14. @garethbowle s Our Goal is Availability • Members can stream Netflix whenever they want • New users can explore and sign up • New members can activate their service and add devices
  • 16. @garethbowle s Freedom and Responsibility • Developers deploy when they want • They also manage their own capacity and autoscaling • And are on-call to fix anything that breaks at 3am!
  • 17. How the heck do you test this stuff @garethbowle s ?
  • 18. @garethbowle s Failure is All Around Us • Disks fail • Power goes out - and your backup generator fails • Software bugs are introduced • People make mistakes
  • 19. @garethbowle s Design to Avoid Failure • Exception handling • Redundancy • Fallback or degraded experience (circuit breakers) • But is it enough ?
  • 20. @garethbowle s It’s Not Enough • How do we know we’ve succeeded ? • Does the system work as designed ? • Is it as resilient as we believe ? • How do we avoid drifting into failure ?
  • 21. @garethbowle s More Testing ! • Unit • Integration • Stress • Exhaustive testing to simulate all failure modes
  • 22. @garethbowle s Exhaustive Testing ~ Impossible • Massive, rapidly changing data sets • Internet scale traffic • Complex interaction and information flow • Independently-controlled services • All while innovating and building features
  • 23. @garethbowle s Another Way • Cause failure deliberately to validate resiliency • Test design assumptions by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it regularly
  • 26. @garethbowle s Chaos Monkey • The original Monkey (2009) • Randomly terminates instances in a cluster • Simulates failures inherent to running in the cloud • During business hours • Default for production services
  • 27. @garethbowle s What did we do once we were able to handle Chaos Monkey ? Bring in bigger Monkeys !
  • 29. @garethbowle s Chaos Gorilla • Simulate an Availability Zone becoming unavailable • Validate multi-AZ redundancy • Deploy to multiple AZs by default • Run regularly (but not continually !)
  • 31. @garethbowle s Chaos Kong • “One louder” than Chaos Gorilla • Simulate an entire region outage • Used to validate our “active-active” region strategy • Traffic has to be switched to the new region • Run once every few months
  • 33. @garethbowle s • Simulate dLegaratdeend cinyst aMnceosnkey • Ensure degradation doesn’t affect other services • Multiple scenarios: network, CPU, I/O, memory • Validate that your service can handle degradation • Find effects on other services, then validate that they can handle it too
  • 35. @garethbowle s • Conformity Monkey Apply a set of conformity rules to all instances • Notify owners with a list of instances and problems • Example rules • Standard security groups not applied • Instance age is too old • No health check URL
  • 36. @garethbowle s Some More Monkeys • Janitor Monkey - clean up unused resources • Security Monkey - analyze, audit and notify on AWS security profile changes • Howler Monkey - warn if we’re reaching resource limits
  • 37. @garethbowle s Try it out ! • Open sourced and available at https://github.com/Netflix/SimianArmy and https://github.com/Netflix/security_monkey • Chaos, Conformity, Janitor and Security available now; more to come • Ported to VMWare
  • 38. @garethbowle s What’s Next ? • Finer grained control • New failure modes • Make chaos testing as well-understood as regular regression testing
  • 39. @garethbowle s A message from the owners “Use Chaos Monkey to induce various kinds of failures in a controlled environment.” AWS blog post following the mass instance reboot in Sep 2014: http://aws.amazon.com/blogs/aws/ec2- maintenance-update-2/
  • 40. @garethbowle s Thank You ! Email: gbowles@{gmail,netflix}.com Twitter: @garethbowles Linkedin: www.linkedin.com/in/garethbowles We’re hiring - test engineers, and chaos engineers ! http://jobs.netflix.com

Notes de l'éditeur

  1. Notes
  2. Hi, everyone. Thanks for coming today. I’m going to talk about some testing challenges we face at Netflix. In particular, we have such a large and complex distributed system that that testing it exhaustively in an isolated environment is next to impossible. To meet that challenge we came up with the Simian Army, a set of tools that actually run in production, and test our ability to survive various types of failure. I’ll spend a bit of time talking about Netflix and our streaming service to set up the problem space, then go over some of the things we need to test for and how more traditional test practices can fall short. Finally I’ll go into some detail on the Monkeys themselves. Feel free to ask questions during the talk.
  3. A little bit about me. I’ve been with Netflix for 4 years and I’m part of our Engineering Tools team. We’re responsible for developer productivity, with the goal that any engineer can build and deploy their code with the minimum possible effort. Before Netflix I spent a long time in test engineering and technical operations, so once I got to Netflix I was fascinated to see how such a complex system gets tested. Let’s take a look at how it all works.
  4. Who here ISN’T familiar with Netflix ? Any customers ? Thanks very much ! Netflix is first & foremost an entertainment company, but you can also look at us as an engineering company that creates all the technology to serve up that entertainment, and also collects a ton of data on who watches what, when they watch it, and how much they watch. We continuously analyze all that data to improve our customers’ entertainment experience by making it easy for them to find things they want to watch, and making sure they have a top quality viewing experience when they get comfy on the sofa (or the bus, or in the park, or wherever they can get connected). So there’s a lot of engineering that goes on behind the scenes to make all that possible.
  5. Some data that might be new to some of you guys. Our membership is growing fast and we’re now in more than 50 countries - all countries in North and South America, plus most of Europe. The amount of content our viewers are watching is growing, too. Fun fact - a billion hours of video is about 114,000 years. 114,000 years ago, homo sapiens had just started to migrate out of Africa and you could find hippos in the River Thames in England.
  6. This is an overview of our current architecture. 2 billion requests flow in from all kinds of connected devices - game consoles, PCs and Macs, phones, tablets, TVs, DVD players, and more. Those generate 12 billion outbound requests to individual services. The diagram shows some of the main ones: the personalization engine that recommends what to watch based on your viewing history and ratings, movie metadata to give you information about what you’re watching, and one of the most important - the A/B test engine that lets us serve up a different customer experience to a set of users and measure its effect on how much they watch, what they watch and when the watch it.
  7. Pause for audience reaction here. Anyone care to suggest what this diagram shows ? We call it the “Shock and Awe” diagram at Netflix ! It’s generated by one of our monitoring systems and shows the interconnections between all of our services and data. Don’t look too hard, you won’t make out much detail - it’s only meant to illustrate the complexity of the system.
  8. We run our production systems on Amazon Web Services, which many of you are probably familiar with. We’re one of AWS’ biggest customers, apart from Amazon themselves.
  9. Using Amazon Web Services lets us stop worrying about procurement of hardware - servers, network switches, storage, firewalls, load balancers ... AWS allows us to scale up and down without worrying about exceeding or underusing data center capacity. And since every AWS service has an API, we can automate our deployments and throw away all those runbooks. Each AWS service is available in multiple regions (geographic areas) and multiple availability zones (data centers) within each region.
  10. So now that I’ve sort of described how everything works, here’s a big problem - in actual fact, there’s nobody who knows how it all works in depth. Although we have world experts in areas such as personalization, video encoding and machine learning, there’s just too much going on and it’s changing too fast for any individual to keep up.
  11. AWS is increasingly reliable, but it has had some fairly spectacular outages as well as many smaller ones. When you run on a cloud platform that’s not under your control, you have to be able to cope with these outages. On June 29th 2012, a storm caused a widespread power outage in Northern Virginia that took out many instances and database servers. Netflix streaming was affected for a while.
  12. An even bigger outage happened on Christmas Eve 2012, when an Amazon engineer made a mistake that took out many Elastic Load Balancers in the US East region, which at that time was Netflix’ primary region for serving traffic. ELBs aren’t replicated between availability zones, they only apply to a given region. Many Netflix customers were affected; luckily for us, Christmas Eve is a much less busy day than Christmas Day when everyone gets given Netflix subscriptions, and the problem was fixed by the 25th.
  13. In late September this year, AWS restarted a large number of instances, in multiple regions, to patch a security bug in the virtualization software that the instances run on. This time we were hardly affected at all and there was negligible impact on customers - again, more in our tech blog post.
  14. Give those kinds of problems, we need to work pretty hard to work around them. Netflix wants our 50 million plus members to be able to play movies and TV shows whenever and wherever they want. We also want to make it as simple and fast as possible for people to sign up and start using their new subscriptions.
  15. One little digression that’s an important part of how we meet these challenges - we couldn’t do it without our company culture, which we’re quite proud of - proud enough to publish the 126-slide deck I’ve linked here. All those slides can be boiled down to one key takeaway, which we call Freedom and Responsibility.
  16. Here’s how the “freedom & responsibility” principle applies to our technical development and deployment. Freedom - engineers deploy when and how often they need to, and control their own production capacity and scaling. Responsibility - every engineer in each service team is in the PagerDuty rotation in case things go wrong.
  17. So how can these teams be confident that their new versions will still work with all their dependencies, under all kinds of failure conditions ? It’s a very tough problem.
  18. Failure is unavoidable. We already saw some ways that our AWS platform can fail. Add to that bugs in our own software, and the inevitable human errors that you get when there are actual people involved in the development and deployment pipeline.
  19. We can do a lot to make our code handle failure gracefully. We can catch errors as exceptions and make sure they are handled in a way that doesn’t crash the code. We can run multiple instances of our services to avoid single points of failure. And we can use technologies such as the circuit breaker pattern to have services provide a degraded experience if one of their dependent services goes offline. For example, if our recommendation service is unavailable, we don’t show you a blank list of recommendations on your Netflix page - we fall back to a list of the most-watched content.
  20. But that only gets us so far. We want to make sure all our features work properly without waiting for customers to tell us they don’t. We want to know that we are as resilient as we think we are, without waiting for an outage to happen. Given the scale we run at, it’s effectively impossible to create a realistic test system for running load and reliability tests. We also want to be sure that the configuration of our services doesn’t diverge as we redeploy them over time - this can lead to errors that are very hard to debug given the thousands of instances that we run in production.
  21. So, most of you guys are testers and probably have this reaction - let’s do more testing ! But can we effectively simulate such a large- scale distributed system - and what’s more, can we predict every possible failure mode and encode it into our tests ?
  22. Today’s large internet systems have become too big and complex to just rely on traditional testing - but don’t get me wrong, all those types of testing I just mentioned have a very important place at Netflix, and we have some of the best test engineers in the business working on them. But here are some of the things they struggle with. It’s very hard to find realistic test data. It would be hugely expensive for us to create a similarly-sized copy of our production system for testing - not quite “copying the internet”, but getting there. Because teams deploy their own changes on different schedules, it’s difficult to keep up with changes and code them into integration tests.
  23. So we came up with the idea of deliberately triggering failures in production, to augment our more traditional testing. By causing our own failures on a known schedule, we can be prepared to deal with their effects and test our assumptions in a predictable way, rather than having a fire drill when a “real” outage happens.
  24. So with all that context done with, let’s take a look at our lovable monkeys.
  25. Chaos Monkey was the one who started it all.
  26. Chaos Monkey has been around in some form for about 5 years. It’s a service that looks for groups of instances (known as clusters) of each of our services and picks a random instance to terminate, on a defined schedule and with a defined probability of termination. This simulates a fairly frequent thing in AWS (although not nearly as frequent as it used to be) where instances are terminated unexpectedly, usually due to a failure in the underlying hardware. We run Chaos Monkey during business hours so that engineers are on hand to diagnose and fix problems,rather than getting a 3am page. If you deploy a new Netflix service, Chaos Monkey will be enabled for it unless you explicitly turn it off. We’ve got to a point where Chaos Monkey instance terminations go virtually unnoticed.
  27. Gorillas are bigger than monkeys and can carry bigger weapons.
  28. Chaos Gorilla takes out an entire Availability Zone. AWS has multiple regions in different parts of the world, such Eastern USA, Western USA, Asia Pacific and Western Europe. Each region has multiple Availability Zones. These are equivalent to physical data centers in different geographic locations - for example, the US East region is located in Virginia but has 3 separate Availability Zones. Running Chaos Gorilla ensures that our services is running correctly in multiple Availability Zones, and that we have sufficient capacity in each zone to handle our traffic load. Runs of Chaos Gorilla are announced ahead of time, and our Reliability Engineering team sets up an incident room where engineers from each service team can watch progress.
  29. As we already picked the gorilla, we had to resort to a fictional creature to cause even bigger chaos.
  30. Once we were happy that we could survive an Availability Zone outage, we wanted to go a step further and see if we could cope with an entire region being taken out. This hasn’t happened in reality yet, but there’s a small possibility that it could - for example, the us-west-1 region is in Northern California, so a really big earthquake could feasibly take out all of its Availability Zones. And the Elastic Load Balancer outage at the end of 2012 did have the effect of bringing down a key service in an entire region. To handle this, we had to rearchitect to an “active-active” setup where we have complete copies of our services and data running in two different regions. If an Availability Zone goes down we just have to make sure we have enough capacity in the surviving zones to handle all our traffic, but if we lose a region we also have to reroute all traffic to the backup region. Chaos Kong gets an outing every few months, again with an incident room where engineers can watch progress and react to any problems.
  31. So we can deal with instances disappearing individually, or in large numbers. But what if instances are still there, but running in a degraded state ? Because our architecture involves so many interdependent services, we have to be careful that problems with one service don’t cascade to other services.
  32. This is where Latency Monkey comes in. It can simulate multiple types of degradation: network connections maxing out, high CPU loads, high disk I/O and running out of memory. Degradation in a service oriented architecture is extremely hard to test exhaustively. With Latency Monkey we can introduce degradation in a controlled way (like the recommendations example I mentioned earlier), find any problems in dependent services and fix them, then verify those fixes. Some service teams even discovered dependencies they didn’t know they had by running Latency Monkey and finding unexpected degradation in the dependent services.
  33. One other key aspect of running so many services, all with dozens or hundreds of instances, is that it’s very important to keep all of the instances consistent. They should all have the same system configuration and the same version and configuration of the service, for example. This decreases the complexity / surface area of the testing
  34. We use Conformity Monkey to automate these checks. It runs over all services at a fixed interval, and notifies service owners when any of the conditions are not met. An email is sent containing a list of rule violations, each with a list of non-conforming instances. Here are a few examples of the things we check: Instances should be in the correct security groups so that they are reachable by other services and our monitoring and deployment tools. Instances shouldn’t have been running for more than a given time, which depends on how often the service is deployed. Older instances could be running an out of date version of the service. Instances should all have a valid health check URL so that our monitoring tools can know whether they are running properly.
  35. I’d like to give an honourable mention to a few more Simian Army members. Janitor Monkey looks for unused resources and deletes them. Examples are machine images or launch configurations that are no longer in use by any running instances. Security Monkey is similar to Conformity Monkey, but specializes in tracking and evaluating security changes. Howler Monkey warns us if we’re running out of AWS resources - for example, if we have nearly maxed out on the number of instances of a given type that are available in a particular region. It also looks for SSL certificates that are about to expire. We don’t seem to have a Netflix logo for this one, so I had to get clearance from a real howler monkey.
  36. There’s a lot more we want to do with the Monkeys. We’d like to have a way to induce failures that are more chaotic than the individual instances that Chaos Monkey knocks out, but less impactful than having Chaos Gorilla take out an entire availability zone. We’re constantly on the lookout for new failure modes that we can rigger - hopefully before they happen in the wild. Eventually, we’d like to make chaos testing in large distributed systems as well understood and commonly practiced as regular regression testing.
  37. After the mass instance reboot I mentioned early, Amazon themselves recommended that AWS customers use Chaos Monkey to test their resilience. High praise indeed !
  38. You can contact me in one of these ways, or ask your question now. thanks again