Resiliency through Failure @ OSCON 2013

@atseitlin
Resiliency through failure
Netflix's Approach to Extreme Availability in the Cloud
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin

@atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
more than 38 million
members in 40
countries enjoying more
than one billion hours
of TV shows and movies
per month, including
original series[1]
[1] http://ir.netflix.com/

@atseitlin
A complex distributed system

@atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
Browse
Play
Watch

@atseitlin
Highly Available Architecture
Micro-services, redundancy,
resiliency

@atseitlin
Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie
group chooser
Each icon is
three to a few
hundred
instances
across three
AWS zones

@atseitlin
Component Micro-Services
Test With Chaos Monkey, Latency Monkey

@atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Replicas
Zone B
Replicas
Zone C
Load Balancers

@atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Replicas
Zone A
Replicas
Zone B
Replicas
Zone C
Load Balancers

@atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers

@atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…

@atseitlin
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Fail often

@atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up

@atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your service blind
• Monitor services, not instances
– Make instance failure a non-event
• Don’t pay people to watch screens
– Instead pay them to build alerting

@atseitlin
Edda
AWS
Instances, ASGs, et
c.
Eureka Services
metadata
AppDynamics
Request flow
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html

@atseitlin
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}

@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, auto-delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds

@atseitlin
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

@atseitlin
Our goal is availability
• Members can stream Netflix whenever they
want
• New users can explore and sign up for the
service
• New members can activate their service and
add new devices

@atseitlin
Failure is all around us
• Disks fail
• Power goes out. And your generator fails.
• Software bugs introduced
• People make mistakes
Failure is unavoidable

@atseitlin
We design around failure
• Exception handling
• Clusters
• Redundancy
• Fault tolerance
• Fall-back or degraded experience (Hystrix)
• All to insulate our users from failure
Is that enough?

@atseitlin
It’s not enough
• How do we know if we’ve succeeded?
• Does the system work as designed?
• Is it as resilient as we believe?
• How do we prevent drifting into failure?
The typical answer is…

@atseitlin
More testing!
• Unit testing
• Integration testing
• Stress testing
• Exhaustive test suites to simulate and test all
failure mode
Can we effectively simulate a large-
scale distributed system?

@atseitlin
Building distributed systems is hard
Testing them exhaustively is even harder
• Massive data sets and changing shape
• Internet-scale traffic
• Complex interaction and information flow
• Asynchronous nature
• 3rd party services
• All while innovating and building features
Prohibitively expensive, if not impossible,
for most large-scale systems

@atseitlin
What if we could reduce variability of failures?

@atseitlin
There is another way
• Cause failure to validate resiliency
• Test design assumption by stressing them
• Don’t wait for random failure. Remove its
uncertainty by forcing it periodically

@atseitlin
And that’s exactly what we did

@atseitlin
Chaos Monkey taught us…
• State is bad
• Clusters are good
• Surviving single instance failure is not enough

@atseitlin
Lots of instances fail

@atseitlin
Chaos Gorilla taught us…
• Hidden assumptions on deployment topology
• Infrastructure control plane can be a
bottleneck
• Large scale events are hard to simulate
• Rapidly shifting traffic is error prone
• Smooth recovery is a challenge
• Cassandra works as expected

@atseitlin
What about larger catastrophes?
Anyone remember Sandy?

@atseitlin
Chaos Kong (*some day soon*)

@atseitlin
The Sick and Wounded

@atseitlin
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

@atseitlin
Latency Monkey taught us
• Startup resiliency is often missed
• An ongoing unified approach to runtime
dependency management is important (visibility &
transparency gets missed otherwise)
• Know thy neighbor (unknown dependencies)
• Fall backs can fail too

@atseitlin
Clutter accumulates
• Complexity
• Cruft
• Vulnerabilities
• Cost

@atseitlin
Janitor Monkey taught us…
• Label everything
• Clutter builds up

@atseitlin
Ranks of the Simian Army
• Chaos Monkey
• Chaos Gorilla
• Latency Monkey
• Janitor Monkey
• Conformity
Monkey
• Circus Monkey
• Doctor Monkey
• Howler Monkey
• Security Monkey
• Chaos Kong
• Efficiency Monkey

@atseitlin
Observability is key
• Don’t exacerbate real customer issues with
failure exercises
• Deep system visibility is key to root-cause
failures and understand the system

@atseitlin
Organizational elements
• Every engineer is an operator of the service
• Each failure is an opportunity to learn
• Blameless culture
Goal is to create a learning organization

@atseitlin
Assembling the Puzzle

@atseitlin
Netflix Highly Available Platform
now open
@NetflixOSS

@atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Service
Astyanax
Cassandra client for Java
CassJMeter
Cassandra test suite
Cassandra
Multi-region EC2 datastore
support
Aegisthus
Hadoop ETL for Cassandra
Ice
Spend analytics
Governator
Library lifecycle and dependency
injection
Odin
Cloud orchestration
Blitz4j Async logging
Exhibitor
Zookeeper as a Service
Curator
Zookeeper Patterns
EVCache
Memcached as a Service
Eureka / Discovery
Service Directory
Archaius
Dynamics Properties Service
Edda
Config state with history
Denominator
Ribbon
REST Client + mid-tier LB
Karyon
Instrumented REST Base Serve
Servo and Autoscaling Scripts
Genie
Hadoop PaaS
Hystrix
Robust service pattern
RxJava Reactive Patterns
Asgard
AutoScaleGroup based AWS
console
Chaos Monkey
Robustness verification
Latency Monkey
Janitor Monkey
Bakeries / Aminotor
Legend

@atseitlin
How does it all fit together?

@atseitlin
Our Current Catalog of Releases
Free code available at http://netflix.github.com

@atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Edge Services
• Many, many more
jobs.netflix.com

@atseitlin
Takeaways
Create fine-grained micro-services. Don’t trust your dependencies.
Regularly inducing failure in your production environment validates resiliency
and increases availability
Netflix has built and deployed a scalable global and highly available Platform
as a Service and opened sourced it (NetflixOSS)
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
@atseitlin @NetflixOSS

@atseitlin
Thank you!
Any questions?
Ariel Tseitlin
@atseitlin

Resiliency through Failure @ OSCON 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Resiliency through Failure @ OSCON 2013

Similar to Resiliency through Failure @ OSCON 2013 (20)

Recently uploaded

Recently uploaded (20)

Resiliency through Failure @ OSCON 2013

Editor's Notes