SlideShare a Scribd company logo
1 of 58
@atseitlin
Resiliency through failure
Netflix's Approach to Extreme Availability in the Cloud
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin
@atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
more than 38 million
members in 40
countries enjoying more
than one billion hours
of TV shows and movies
per month, including
original series[1]
[1] http://ir.netflix.com/
@atseitlin
A complex distributed system
@atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
Browse
Play
Watch
@atseitlin
Highly Available Architecture
Micro-services, redundancy,
resiliency
@atseitlin
Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie
group chooser
Each icon is
three to a few
hundred
instances
across three
AWS zones
@atseitlin
Component Micro-Services
Test With Chaos Monkey, Latency Monkey
@atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
@atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
@atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
@atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…
@atseitlin
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Fail often
@atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
@atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your service blind
• Monitor services, not instances
– Make instance failure a non-event
• Don’t pay people to watch screens
– Instead pay them to build alerting
@atseitlin
Edda
AWS
Instances, ASGs, et
c.
Eureka Services
metadata
AppDynamics
Request flow
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
@atseitlin
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}
@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, auto-delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds
@atseitlin
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
@atseitlin
@atseitlin
@atseitlin
Our goal is availability
• Members can stream Netflix whenever they
want
• New users can explore and sign up for the
service
• New members can activate their service and
add new devices
@atseitlin
Failure is all around us
• Disks fail
• Power goes out. And your generator fails.
• Software bugs introduced
• People make mistakes
Failure is unavoidable
@atseitlin
We design around failure
• Exception handling
• Clusters
• Redundancy
• Fault tolerance
• Fall-back or degraded experience (Hystrix)
• All to insulate our users from failure
Is that enough?
@atseitlin
It’s not enough
• How do we know if we’ve succeeded?
• Does the system work as designed?
• Is it as resilient as we believe?
• How do we prevent drifting into failure?
The typical answer is…
@atseitlin
More testing!
• Unit testing
• Integration testing
• Stress testing
• Exhaustive test suites to simulate and test all
failure mode
Can we effectively simulate a large-
scale distributed system?
@atseitlin
Building distributed systems is hard
Testing them exhaustively is even harder
• Massive data sets and changing shape
• Internet-scale traffic
• Complex interaction and information flow
• Asynchronous nature
• 3rd party services
• All while innovating and building features
Prohibitively expensive, if not impossible,
for most large-scale systems
@atseitlin
What if we could reduce variability of failures?
@atseitlin
There is another way
• Cause failure to validate resiliency
• Test design assumption by stressing them
• Don’t wait for random failure. Remove its
uncertainty by forcing it periodically
@atseitlin
And that’s exactly what we did
@atseitlin
Instances fail
@atseitlin
@atseitlin
Chaos Monkey taught us…
• State is bad
• Clusters are good
• Surviving single instance failure is not enough
@atseitlin
Lots of instances fail
@atseitlin
Chaos Gorilla
@atseitlin
Chaos Gorilla taught us…
• Hidden assumptions on deployment topology
• Infrastructure control plane can be a
bottleneck
• Large scale events are hard to simulate
• Rapidly shifting traffic is error prone
• Smooth recovery is a challenge
• Cassandra works as expected
@atseitlin
What about larger catastrophes?
Anyone remember Sandy?
@atseitlin
Chaos Kong (*some day soon*)
@atseitlin
The Sick and Wounded
@atseitlin
Latency Monkey
@atseitlin
@atseitlin
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
@atseitlin
Latency Monkey taught us
• Startup resiliency is often missed
• An ongoing unified approach to runtime
dependency management is important (visibility &
transparency gets missed otherwise)
• Know thy neighbor (unknown dependencies)
• Fall backs can fail too
@atseitlin
Entropy
@atseitlin
Clutter accumulates
• Complexity
• Cruft
• Vulnerabilities
• Cost
@atseitlin
Janitor Monkey
@atseitlin
Janitor Monkey taught us…
• Label everything
• Clutter builds up
@atseitlin
Ranks of the Simian Army
• Chaos Monkey
• Chaos Gorilla
• Latency Monkey
• Janitor Monkey
• Conformity
Monkey
• Circus Monkey
• Doctor Monkey
• Howler Monkey
• Security Monkey
• Chaos Kong
• Efficiency Monkey
@atseitlin
Observability is key
• Don’t exacerbate real customer issues with
failure exercises
• Deep system visibility is key to root-cause
failures and understand the system
@atseitlin
Organizational elements
• Every engineer is an operator of the service
• Each failure is an opportunity to learn
• Blameless culture
Goal is to create a learning organization
@atseitlin
Assembling the Puzzle
@atseitlin
Netflix Highly Available Platform
now open
@NetflixOSS
@atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Service
Astyanax
Cassandra client for Java
CassJMeter
Cassandra test suite
Cassandra
Multi-region EC2 datastore
support
Aegisthus
Hadoop ETL for Cassandra
Ice
Spend analytics
Governator
Library lifecycle and dependency
injection
Odin
Cloud orchestration
Blitz4j Async logging
Exhibitor
Zookeeper as a Service
Curator
Zookeeper Patterns
EVCache
Memcached as a Service
Eureka / Discovery
Service Directory
Archaius
Dynamics Properties Service
Edda
Config state with history
Denominator
Ribbon
REST Client + mid-tier LB
Karyon
Instrumented REST Base Serve
Servo and Autoscaling Scripts
Genie
Hadoop PaaS
Hystrix
Robust service pattern
RxJava Reactive Patterns
Asgard
AutoScaleGroup based AWS
console
Chaos Monkey
Robustness verification
Latency Monkey
Janitor Monkey
Bakeries / Aminotor
Legend
@atseitlin
How does it all fit together?
@atseitlin
@atseitlin
Our Current Catalog of Releases
Free code available at http://netflix.github.com
@atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Edge Services
• Many, many more
jobs.netflix.com
@atseitlin
Takeaways
Create fine-grained micro-services. Don’t trust your dependencies.
Regularly inducing failure in your production environment validates resiliency
and increases availability
Netflix has built and deployed a scalable global and highly available Platform
as a Service and opened sourced it (NetflixOSS)
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/atseitlin
@atseitlin @NetflixOSS
@atseitlin
Thank you!
Any questions?
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin

More Related Content

What's hot

Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevReactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevJavaDayUA
 
Security as Code
Security as CodeSecurity as Code
Security as CodeEd Bellis
 
Chaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsChaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsNils Meder
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...Gareth Bowles
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmervmiss33
 
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...Aaron Rinehart
 
Slam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterSlam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterAtlassian
 
Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness John Willis
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...mircodotta
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016John Willis
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraDataStax Academy
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDynatrace
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? John Willis
 
Mobile User Experience: Auto Drive through Performance Metrics
Mobile User Experience:Auto Drive through Performance MetricsMobile User Experience:Auto Drive through Performance Metrics
Mobile User Experience: Auto Drive through Performance MetricsAndreas Grabner
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!Andreas Grabner
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesGet Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesDeborah Schalm
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPhil Cryer
 
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertoncdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertonMiles Blatstein
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast DeliveryAdrian Cockcroft
 
All Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaAll Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaSteve Poole
 

What's hot (20)

Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max MyslyvtsevReactive programming and Hystrix fault tolerance by Max Myslyvtsev
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
 
Security as Code
Security as CodeSecurity as Code
Security as Code
 
Chaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos ExperimentsChaos Engineering - Limiting Damage During Chaos Experiments
Chaos Engineering - Limiting Damage During Chaos Experiments
 
I Don't Test Often ...
I Don't Test Often ...I Don't Test Often ...
I Don't Test Often ...
 
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa PalmerOpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
OpenStack in the Enterprise - NJ VMUG June 9, 2015 - Melissa Palmer
 
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
HealthConDX Virtual Summit 2021 - How Security Chaos Engineering is Changing ...
 
Slam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data CenterSlam Dunk with Splunk and Stash Data Center
Slam Dunk with Splunk and Stash Data Center
 
Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness Dockercon USA 2016 - Immutable Awesomeness
Dockercon USA 2016 - Immutable Awesomeness
 
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
 
Immutable Service Delivery Shenzhen 2016
Immutable Service Delivery   Shenzhen 2016Immutable Service Delivery   Shenzhen 2016
Immutable Service Delivery Shenzhen 2016
 
Vertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache CassandraVertafore: Database Evaluation - Selecting Apache Cassandra
Vertafore: Database Evaluation - Selecting Apache Cassandra
 
DevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the WorldDevOps: Cultural and Tooling Tips Around the World
DevOps: Cultural and Tooling Tips Around the World
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How?
 
Mobile User Experience: Auto Drive through Performance Metrics
Mobile User Experience:Auto Drive through Performance MetricsMobile User Experience:Auto Drive through Performance Metrics
Mobile User Experience: Auto Drive through Performance Metrics
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
 
Get Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled ArchitecturesGet Loose! Microservices and Loosely Coupled Architectures
Get Loose! Microservices and Loosely Coupled Architectures
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy PembertoncdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
cdSummit Austin - Orchestrating the continuous delivery process - Andy Pemberton
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast Delivery
 
All Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about JavaAll Change how the economics of Cloud will make you think differently about Java
All Change how the economics of Cloud will make you think differently about Java
 

Viewers also liked

AWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAmazon Web Services
 
Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNRonald Kleverlaan
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015Ariel Tseitlin
 
Innovation, Service, and Shared References
Innovation, Service, and Shared ReferencesInnovation, Service, and Shared References
Innovation, Service, and Shared ReferencesEric Reiss
 
Quiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thQuiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thawltech
 
Mohamed Sayed C.V.
Mohamed Sayed C.V.Mohamed Sayed C.V.
Mohamed Sayed C.V.darsh0225
 
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsCrowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsRonald Kleverlaan
 
Nationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingNationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingRonald Kleverlaan
 
Crowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyCrowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyRonald Kleverlaan
 
091710 NTNUMUN說明會
091710 NTNUMUN說明會091710 NTNUMUN說明會
091710 NTNUMUN說明會Peitung Wang
 
The role of the information architect
The role of the information architectThe role of the information architect
The role of the information architectEric Reiss
 
Innovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitInnovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitEric Reiss
 
Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Eric Reiss
 
Happy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateHappy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateBillen
 
Using New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsUsing New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsKeerthi Kiran K
 

Viewers also liked (20)

AWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS ResiliencyAWS August Webinar Series - DDoS Resiliency
AWS August Webinar Series - DDoS Resiliency
 
Crowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVNCrowdfunding strategie sessie IVN
Crowdfunding strategie sessie IVN
 
Create a Loyal Following of Customers
Create a Loyal Following of CustomersCreate a Loyal Following of Customers
Create a Loyal Following of Customers
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015
 
Innovation, Service, and Shared References
Innovation, Service, and Shared ReferencesInnovation, Service, and Shared References
Innovation, Service, and Shared References
 
Quiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8thQuiz Topics For 1st Quiz 8th
Quiz Topics For 1st Quiz 8th
 
Mohamed Sayed C.V.
Mohamed Sayed C.V.Mohamed Sayed C.V.
Mohamed Sayed C.V.
 
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucsCrowdfunding in de Zorg - Verschillende vormen en tips en trucs
Crowdfunding in de Zorg - Verschillende vormen en tips en trucs
 
Nationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - CrowdfundingNationaal Monumentencongres - Crowdfunding
Nationaal Monumentencongres - Crowdfunding
 
Community Mill: Data, Media & Communities
Community Mill: Data, Media & CommunitiesCommunity Mill: Data, Media & Communities
Community Mill: Data, Media & Communities
 
Crowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient AcademyCrowdfunding in de Zorg - Novartis Patient Academy
Crowdfunding in de Zorg - Novartis Patient Academy
 
How to access to capitol
How to access to capitolHow to access to capitol
How to access to capitol
 
091710 NTNUMUN說明會
091710 NTNUMUN說明會091710 NTNUMUN說明會
091710 NTNUMUN說明會
 
The role of the information architect
The role of the information architectThe role of the information architect
The role of the information architect
 
Dengue
DengueDengue
Dengue
 
Crowdfunding algeracorridor
Crowdfunding algeracorridorCrowdfunding algeracorridor
Crowdfunding algeracorridor
 
Innovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization SummitInnovation at Israel Mobile Monetization Summit
Innovation at Israel Mobile Monetization Summit
 
Keynote at UX Sofia 2013
Keynote at UX Sofia 2013Keynote at UX Sofia 2013
Keynote at UX Sofia 2013
 
Happy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you CelebrateHappy New Year 2009 how will you Celebrate
Happy New Year 2009 how will you Celebrate
 
Using New media to create a band of youth social journalists
Using New media to create a band of youth social journalistsUsing New media to create a band of youth social journalists
Using New media to create a band of youth social journalists
 

Similar to Resiliency through Failure @ OSCON 2013

Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Ariel Tseitlin
 
LF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesQA or the Highway
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013MassTLC
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native FutureJulie Coonce
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyWilliam Yang
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles
 
Create an architecture for web test automation
Create an architecture for web test automationCreate an architecture for web test automation
Create an architecture for web test automationElias Nogueira
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationChris Dagdigian
 
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesGDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesJames Anderson
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Markus Eisele
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019Erik Peterson
 

Similar to Resiliency through Failure @ OSCON 2013 (20)

Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
 
LF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death StarLF_APIStrat17_Don't Build a Death Star
LF_APIStrat17_Don't Build a Death Star
 
I don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth BowlesI don't always test...but when I do I test in production - Gareth Bowles
I don't always test...but when I do I test in production - Gareth Bowles
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Running a Lean Startup with AWS
Running a Lean Startup with AWSRunning a Lean Startup with AWS
Running a Lean Startup with AWS
 
Fantastic Elastic
Fantastic ElasticFantastic Elastic
Fantastic Elastic
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native Future
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
Antifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A StudyAntifragile, Microservices and DevOps - A Study
Antifragile, Microservices and DevOps - A Study
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
 
ChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptxChaosEngineeringITEA.pptx
ChaosEngineeringITEA.pptx
 
Create an architecture for web test automation
Create an architecture for web test automationCreate an architecture for web test automation
Create an architecture for web test automation
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesGDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos Engineering
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Resiliency through Failure @ OSCON 2013

  • 1. @atseitlin Resiliency through failure Netflix's Approach to Extreme Availability in the Cloud Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin
  • 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with more than 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1] http://ir.netflix.com/
  • 4. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
  • 6. @atseitlin Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
  • 7. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
  • 8. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 9. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 10. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  • 11. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  • 12. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
  • 13. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
  • 14. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
  • 15. @atseitlin Edda AWS Instances, ASGs, et c. Eureka Services metadata AppDynamics Request flow Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
  • 16. @atseitlin Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" … }
  • 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
  • 21. @atseitlin Our goal is availability • Members can stream Netflix whenever they want • New users can explore and sign up for the service • New members can activate their service and add new devices
  • 22. @atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails. • Software bugs introduced • People make mistakes Failure is unavoidable
  • 23. @atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience (Hystrix) • All to insulate our users from failure Is that enough?
  • 24. @atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
  • 25. @atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large- scale distributed system?
  • 26. @atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
  • 27. @atseitlin What if we could reduce variability of failures?
  • 28. @atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
  • 32. @atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
  • 35. @atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
  • 36. @atseitlin What about larger catastrophes? Anyone remember Sandy?
  • 41. @atseitlin Resilient Design – Hystrix, RxJava http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  • 42. @atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fall backs can fail too
  • 44. @atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
  • 46. @atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
  • 47. @atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
  • 48. @atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
  • 49. @atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
  • 51. @atseitlin Netflix Highly Available Platform now open @NetflixOSS
  • 52. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
  • 53. @atseitlin How does it all fit together?
  • 55. @atseitlin Our Current Catalog of Releases Free code available at http://netflix.github.com
  • 56. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Edge Services • Many, many more jobs.netflix.com
  • 57. @atseitlin Takeaways Create fine-grained micro-services. Don’t trust your dependencies. Regularly inducing failure in your production environment validates resiliency and increases availability Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/atseitlin @atseitlin @NetflixOSS
  • 58. @atseitlin Thank you! Any questions? Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin

Editor's Notes

  1. The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.