How Netflix tests in production to augment more traditional testing methods. This talk covers the Simian Army (Chaos Monkey & friends, code coverage in production, and canary testing.
5. @garethbowles
television network with more than 57 million
members in 50 countries enjoying more
than one billion hours of TV shows and
movies per month.
We account for up to 34% of downstream
US internet traffic. Source: http://ir.netflix.com
9. @garethbowles
What AWS Provides
• Machine Images (AMI)
• Instances (EC2)
• Elastic Load Balancers
• Security groups / Autoscaling
groups
• Availability zones and regions
11. @garethbowles
How AWS Can Go Wrong -1
• Service goes down in one or more
availability zones
• 6/29/12 - storm related power outage
caused loss of EC2 and RDS instances
in Eastern US
• https://gigaom.com/2012/06/29/some-of-
amazon-web-services-are-down-again/
12. @garethbowles
How AWS Can Go Wrong - 2
• Loss of service in an entire region
• 12/24/12 - operator error caused loss of
multiple ELBs in Eastern US
• http://techblog.netflix.com/2012/12/a-
closer-look-at-christmas-eve-outage.html
13. @garethbowles
How AWS Can Go Wrong - 3
• Large number of instances get rebooted
• 9/25/14 to 9/30/14 - rolling reboot of
1000s of instances to patch a security
bug
• http://techblog.netflix.com/2014/10/a-
state-of-xen-chaos-monkey-
cassandra.html
14. @garethbowles
Our Goal is Availability
• Members can stream Netflix whenever
they want
• New users can explore and sign up
• New members can activate their service
and add devices
16. @garethbowles
Freedom and Responsibility
• Developers deploy when
they want
• They also manage their
own capacity and
autoscaling
• And are on-call to fix
anything that breaks at
3am!
18. @garethbowles
Failure is All Around Us
• Disks fail
• Power goes out - and your backup
generator fails
• Software bugs are introduced
• People make mistakes
19. @garethbowles
Design to Avoid Failure
• Exception handling
• Redundancy
• Fallback or degraded experience (circuit
breakers)
• But is it enough ?
20. @garethbowles
It’s Not Enough
• How do we know we’ve succeeded ?
• Does the system work as designed ?
• Is it as resilient as we believe ?
• How do we avoid drifting into failure ?
22. @garethbowles
Exhaustive Testing ~
Impossible
• Massive, rapidly changing data sets
• Internet scale traffic
• Complex interaction and information flow
• Independently-controlled services
• All while innovating and building features
23. @garethbowles
Another Way
• Cause failure deliberately to validate
resiliency
• Test design assumptions by stressing
them
• Don’t wait for random failure. Remove
its uncertainty by forcing it regularly
26. @garethbowles
Chaos Monkey
• The original Monkey (2009)
• Randomly terminates instances in a cluster
• Simulates failures inherent to running in the
cloud
• During business hours
• Default for production services
29. @garethbowles
Chaos Gorilla
• Simulate an Availability Zone becoming
unavailable
• Validate multi-AZ redundancy
• Deploy to multiple AZs by default
• Run regularly (but not continually !)
31. @garethbowles
Chaos Kong
• “One louder” than Chaos Gorilla
• Simulate an entire region outage
• Used to validate our “active-active” region
strategy
• Traffic has to be switched to the new region
• Run once every few months
33. @garethbowles
Latency Monkey• Simulate degraded instances
• Ensure degradation doesn’t affect other
services
• Multiple scenarios: network, CPU, I/O,
memory
• Validate that your service can handle
degradation
• Find effects on other services, then validate
that they can handle it too
35. @garethbowles
Conformity Monkey• Apply a set of conformity rules to all
instances
• Notify owners with a list of instances and
problems
• Example rules
• Standard security groups not applied
• Instance age is too old
• No health check URL
36. @garethbowles
Failure Injection Testing (FIT)
• Latency Monkey adds delay / failure on server side of
requests
• Impacts all calling apps - whether they want to
participate or not
• FIT decorates requests with failure data
• Can limit failures to specific accounts or devices, then
dial up
• http://techblog.netflix.com/2014/10/fit-failure-injection-
testing.html
37. @garethbowles
Try it out !
• Open sourced and available at
https://github.com/Netflix/SimianArmy
and
https://github.com/Netflix/security_monke
y
• Chaos, Conformity, Janitor and Security
available now; more to come
• VMware as well as AWS
38. @garethbowles
What’s Next ?
• New failure modes
• Run monkeys more frequently and
aggressively
• Make chaos testing as well-understood
as regular regression testing
39. @garethbowles
A message from the owners
“Use Chaos Monkey to induce various kinds of
failures in a controlled environment.”
AWS blog post following the mass instance
reboot in Sep 2014:
http://aws.amazon.com/blogs/aws/ec2-
maintenance-update-2/
41. @garethbowles
What We Get
• Real time code usage patterns
• Focus testing by prioritizing frequently
executed paths with low test coverage
• Identify dead code that can be removed
42. @garethbowles
How We Do It
• Use Cobertura as it counts how many
times each LOC is executed
• Easy to enable - Cobertura JARs
included in our base AMI, set a flag to
add them to Tomcat’s classpath
• Enable on a single instance
• Very low performance hit
44. @garethbowles
Canaries
• Push changes to a small number of
instances
• Use Asgard for red / black push
• Monitor closely to detect regressions
• Automated canary analysis
• Automatically cancel deployment if
problems occur
45. @garethbowles
Closing Thoughts
• Don’t be scared to test in production !
• You’ll get tons of data that you couldn’t get
from test …
• … and hopefully sleep better at night
46. @garethbowles
Thanks, QA or the Highway !
Email: gbowles@{gmail,netflix}.com
Twitter: @garethbowles
Linkedin:
www.linkedin.com/in/garethbowles
Notes de l'éditeur
Notes
Hi, everyone. Thanks for coming today. This is my first keynote, so I hope I can do it justice !
Just to set some expectations - I heard that at Matt’s keynote last year, he was giving out $20 bills. I’m afraid I don’t have any cash, but I do have plenty of snazzy Simian Army stickers up here.
I’m going to talk about some big testing challenges that we face at Netflix. In particular, we have such a large and complex distributed system that that testing it exhaustively in an isolated environment is next to impossible. To meet that challenge we came up with a few different approaches, and I’m going to talk about three of those today: the Simian Army, which is a set of tools that induces failures in production; code coverage analysis on production servers; and using canaries to test new versions in production.
I’ll spend a bit of time talking about Netflix and our streaming service to set up the problem space, then go over some of the things we need to test for and how more traditional test practices can fall short. Finally I’ll go into some detail on the tools themselves.
A little bit about me. I’ve been with Netflix for 4 1/2 years and I’m part of our Engineering Tools team. We’re responsible for developer productivity, with the goal that any engineer can build and deploy their code with the minimum possible effort.
Before Netflix I spent a long time in test engineering and technical operations, so once I got to Netflix I was fascinated to see how such a complex system gets tested.
Let’s take a look at how it all works.
Who here ISN’T familiar with Netflix ?
Any customers ? Thanks very much !
Netflix is first & foremost an entertainment company, but you can also look at us as an engineering company that creates all the technology to serve up that entertainment, and also collects a ton of data on who watches what, when they watch it, and how much they watch. We continuously analyze all that data to improve our customers’ entertainment experience by making it easy for them to find things they want to watch, and making sure they have a top quality viewing experience when they get comfy on the sofa (or the bus, or in the park, or wherever they can get connected).
So there’s a lot of engineering that goes on behind the scenes to make all that possible.
Some data that might be new to some of you guys. Our membership is growing fast; about two thirds of our members are in the US, but we’re now in more than 50 other countries - all countries in North and South America, plus most of Europe.
The amount of content our viewers are watching is growing, too.
And we’re doing our best to break the internet.
This is an overview of our current architecture.
2 billion requests flow in from all kinds of connected devices - game consoles, PCs and Macs, phones, tablets, TVs, DVD players, and more.
Those generate 12 billion outbound requests to individual services. The diagram shows some of the main ones: the personalization engine that recommends what to watch based on your viewing history and ratings, movie metadata to give you information about what you’re watching, and one of the most important - the A/B test engine that lets us serve up a different customer experience to a set of users and measure its effect on how much they watch, what they watch and when the watch it.
I like to pause for audience reaction here. Anyone care to suggest what this diagram shows ?
We call it the “Shock and Awe” diagram at Netflix ! It’s generated by one of our monitoring systems and shows the interconnections between all of our services and data. Don’t look too hard, you won’t make out much detail - it’s only meant to illustrate the complexity of the system.
We run our production systems on Amazon Web Services, which many of you are probably familiar with. We’re one of AWS’ biggest customers - apart from Amazon itself, who uses AWS to power their e-commerce sites as well as their streaming service that competes with Netflix.
Using Amazon Web Services lets us stop worrying about procurement of hardware - servers, network switches, storage, firewalls, load balancers ...
AWS allows us to scale up and down without worrying about exceeding or underusing data center capacity.
And since every AWS service has an API, we can automate our deployments and throw away all those runbooks.
Each AWS service is available in multiple regions (geographic areas) and multiple availability zones (data centers) within each region.
So now that I’ve described how everything works at a really high level, here’s a big problem - in actual fact, there’s nobody who knows how it all works in depth. Although we have world experts in areas such as personalization, video encoding and machine learning, there’s just too much going on and it’s changing too fast for any individual to keep up.
AWS is increasingly reliable, but it has had some fairly spectacular outages as well as many smaller ones. When you run on a cloud platform that’s not under your control, you have to be able to cope with these outages.
On June 29th 2012, a storm caused a widespread power outage in Northern Virginia that took out many instances and database servers. Netflix streaming was affected for a while.
An even bigger outage happened on Christmas Eve 2012, when an Amazon engineer made a mistake that took out many Elastic Load Balancers in the US East region, which at that time was Netflix’ primary region for serving traffic. ELBs aren’t replicated between availability zones, they only apply to a given region. Many Netflix customers were affected; luckily for us, Christmas Eve is a much less busy day than Christmas Day when everyone gets given Netflix subscriptions, and the problem was fixed by the 25th.
In late September this year, AWS restarted a large number of instances, in multiple regions, to patch a security bug in the virtualization software that the instances run on. This time we were hardly affected at all and there was negligible impact on customers - again, more in our tech blog post.
Given those kinds of problems, we need to work pretty hard to deal with them. Netflix wants our 50 million plus members to be able to play movies and TV shows whenever and wherever they want. We also want to make it as simple and fast as possible for people to sign up and start using their new subscriptions.
One little digression that’s an important part of how we meet these challenges - we couldn’t do it without our company culture, which we’re quite proud of - proud enough to publish the 126-slide deck I’ve linked here. All those slides can be boiled down to one key takeaway, which we call Freedom and Responsibility.
Here’s how the “freedom & responsibility” principle applies to our technical development and deployment.
Freedom - engineers deploy when and how often they need to, and control their own production capacity and scaling.
Responsibility - every engineer in each service team is in the PagerDuty rotation in case things go wrong.
So how can these teams be confident that their new versions will still work with all their dependencies, under all kinds of failure conditions ?
It’s a very tough problem.
Failure is unavoidable. We already saw some ways that our AWS platform can fail. Add to that bugs in our own software, and the inevitable human errors that you get when there are actual people involved in the development and deployment pipeline.
We can do a lot to make our code handle failure gracefully.
We can catch errors as exceptions and make sure they are handled in a way that doesn’t crash the code.
We can run multiple instances of our services to avoid single points of failure.
And we can use technologies such as the circuit breaker pattern to have services provide a degraded experience if one of their dependent services goes offline. For example, if our recommendation service is unavailable, we don’t show you a blank list of recommendations on your Netflix page - we fall back to a list of the most-watched content.
But that only gets us so far.
We want to make sure all our features work properly without waiting for customers to tell us they don’t.
We want to know that we are as resilient as we think we are, without waiting for an outage to happen. Given the scale we run at, it’s effectively impossible to create a realistic test system for running load and reliability tests.
We also want to be sure that the configuration of our services doesn’t diverge as we redeploy them over time - this can lead to errors that are very hard to debug given the thousands of instances that we run in production.
So, most of you guys are testers and probably have this reaction - let’s do more testing !
But can we effectively simulate such a large- scale distributed system - and what’s more, can we predict every possible failure mode and encode it into our tests ?
Today’s large internet systems have become too big and complex to just rely on traditional testing - but don’t get me wrong, all those types of testing I just mentioned have a very important place at Netflix, and we have some of the best test engineers in the business working on them. But here are some of the things they struggle with.
It’s very hard to find realistic test data.
It would be hugely expensive for us to create a similarly-sized copy of our production system for testing - not quite “copying the internet”, but getting there.
Because teams deploy their own changes on different schedules, it’s difficult to keep up with changes and code them into integration tests.
So we came up with the idea of deliberately triggering failures in production, to augment our more traditional testing. By causing our own failures on a known schedule, we can be prepared to deal with their effects and test our assumptions in a predictable way, rather than having a fire drill when a “real” outage happens.
So with all that context done with, let’s take a look at the lovable monkeys who make up our Simian Army.
Chaos Monkey was the one who started it all.
Chaos Monkey has been around in some form for about 5 years.
It’s a service that looks for groups of instances (known as clusters) of each of our services and picks a random instance to terminate, on a defined schedule and with a defined probability of termination.
This simulates a fairly frequent thing in AWS (although not nearly as frequent as it used to be) where instances are terminated unexpectedly, usually due to a failure in the underlying hardware.
We run Chaos Monkey during business hours so that engineers are on hand to diagnose and fix problems,rather than getting a 3am page.
If you deploy a new Netflix service, Chaos Monkey will be enabled for it unless you explicitly turn it off.
We’ve got to a point where Chaos Monkey instance terminations go virtually unnoticed.
We didn’t want to stop once we were happy that we could deal with individual instances dying.
Gorillas are bigger than monkeys and can carry bigger weapons.
Chaos Gorilla takes out an entire Availability Zone.
AWS has multiple regions in different parts of the world, such Eastern USA, Western USA, Asia Pacific and Western Europe.
Each region has multiple Availability Zones. These are equivalent to physical data centers in different geographic locations - for example, the US East region is located in Virginia but has 3 separate Availability Zones.
Running Chaos Gorilla ensures that our service is running correctly in multiple Availability Zones, and that we have sufficient capacity in each zone to handle our traffic load.
Runs of Chaos Gorilla are announced ahead of time, and our Reliability Engineering team sets up an incident room where engineers from each service team can watch progress.
So what next ? As we already picked the gorilla, we had to resort to a fictional creature to cause even bigger chaos.
Once we were happy that we could survive an Availability Zone outage, we wanted to go a step further and see if we could cope with an entire region being taken out.
This hasn’t happened in reality yet, but there’s a small possibility that it could - for example, the us-west-1 region is in Northern California, so a really big earthquake could feasibly take out all of its Availability Zones.
And the Elastic Load Balancer outage at the end of 2012 did have the effect of bringing down a key service in an entire region.
To handle this, we had to rearchitect to an “active-active” setup where we have complete copies of our services and data running in two different regions.
If an Availability Zone goes down we just have to make sure we have enough capacity in the surviving zones to handle all our traffic, but if we lose a region we also have to reroute all traffic to the backup region.
Chaos Kong gets an outing every few months, again with an incident room where engineers can watch progress and react to any problems.
So we can deal with instances disappearing individually, or in large numbers. But what if instances are still there, but running in a degraded state ? Because our architecture involves so many interdependent services, we have to be careful that problems with one service don’t cascade to other services.
This is where Latency Monkey comes in. It can simulate multiple types of degradation: network connections maxing out, high CPU loads, high disk I/O and running out of memory.
Degradation in a service oriented architecture is extremely hard to test exhaustively. With Latency Monkey we can introduce degradation in a controlled way (like the recommendations example I mentioned earlier), find any problems in dependent services and fix them, then verify those fixes.
Some service teams even discovered dependencies they didn’t know they had by running Latency Monkey and finding unexpected degradation in the dependent services.
One other key aspect of running so many services, all with dozens or hundreds of instances, is that it’s very important to keep all of the instances consistent. They should all have the same system configuration and the same version and configuration of the service, for example. This decreases the complexity / surface area of the testing
We use Conformity Monkey to automate these checks.
It runs over all services at a fixed interval, and notifies service owners when any of the conditions are not met. An email is sent containing a list of rule violations, each with a list of non-conforming instances.
Here are a few examples of the things we check:
Instances should be in the correct security groups so that they are reachable by other services and our monitoring and deployment tools.
Instances shouldn’t have been running for more than a given time, which depends on how often the service is deployed. Older instances could be running an out of date version of the service.
Instances should all have a valid health check URL so that our monitoring tools can know whether they are running properly.
We just came up with a system called FIT, for Failure Injection Testing. We need to come up with a monkey for this one !
Latency Monkey injects delays or failures on the server side and thus affects all calling services. If all those calling services don’t have proper fallbacks and timeouts implemented, they can stop working and impact customers - which we obviously want to avoid.
FIT allows failures to be simulated on the client side. We can add failure data to a specific set of API calls and propagate it through the system so that only the services we want to test are affected. We usually start with a specific test customer account or a particular client device, then dial up the failures to affect more and more production traffic if the initial results look good.
Check the Netflix Tech Blog for more details.
You can try the Monkeys out for yourself by going to the Netflix GitHub page. There’s an active community of users, and Netflix engineers regularly monitor the mailing list.
Chaos, Conformity, Janitor and Security Monkeys are currently available, with more to come.
For those of you not using AWS but with an in house VMWare setup - you can use the Monkeys too, thanks to some of our open source contributors.
There’s a lot more we want to do with the Monkeys.
We’d like to have a way to induce failures that are more chaotic than the individual instances that Chaos Monkey knocks out, but less impactful than having Chaos Gorilla take out an entire availability zone.
We’re constantly on the lookout for new failure modes that we can trigger - hopefully before they happen in the wild.
We’re working on an effort to increase the frequency and reach of monkey runs. This is in response to some interesting data - our uptime degraded when we ran the monkeys less frequently.
Eventually, we’d like to make chaos testing in large distributed systems as well understood and commonly practiced as regular regression testing.
After the mass instance reboot I mentioned early, Amazon themselves recommended that AWS customers use Chaos Monkey to test their resilience. High praise indeed !
Let’s move on to our second way of testing in production - code coverage analysis. Our view is that if you’re not doing it in prod, you’re missing out on a ton of useful data.
In contrast to just showing what code paths are covered by test, we get data on the paths which are actually used in production, plus how often each path is run. We compare the production code coverage data with the results from our test environment.
This enables us to focus our testing, for example by identifying commonly used code paths with low test coverage. We can also find dead code that is never used in production, and remove that code and its tests to make maintenance easier.
All of our services run on the JVM, so we needed a Java code coverage tool. We picked Cobertura because it counts how many times each line of code is executed, in contrast to most other tools which just give you a binary result of whether or not the line was executed.
We put the Cobertura JAR files on our base machine image that all of the services build on, and set a flag at runtime to enable code coverage analysis.
We’ll usually run code coverage on a single instance in a service cluster, and leave it running for a day or so to make sure that all the code paths are hit. We’ve found the performance hit to be very low - typically less than 5% degradation in performance.
Our third way of testing in production is the use of canary deployments.
The term came from coal mining, when miners wanted to detect dangerous levels of coal gas in the mine shaft. They would take down a canary, which was very sensitive to the gas; if the canary keeled over or died, it was time to get out of there before the miners did the same.
Automated rollback happens near beginning of canary if there is a drastic regression - we’re still learning what is “bad enough”, varies by team.
If regression is less severe, team will make decision to put the new service in prod, or cancel the push. Each team can have a different level of tolerance for regressions. Teams that deploy very frequently will tend to have a lower tolerance than teams who deploy less often and maybe don’t have their automated analysis fully developed.
And that’s the end of my talk; before we go to some questions, I’d like to give a big thanks to the organizers for putting on such a great conference, and to all of you for coming. I’ve been really impressed by what a great testing community you have here.