Slides presented at JPL February 2013. An updated version of the Re:Invent slides with NetflixOSS description. Black background for a change to help projector contrast.
Ensuring Technical Readiness For Copilot in Microsoft 365
High Availability Architecture and NetflixOSS
1. Highly Available Architecture at
Netflix
JPL - February 2013
Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
2. Netflix Inc.
Netflix is the world’s leading Internet television network
with more than 33 million members in 40 countries
enjoying more than one billion hours of TV shows and
movies per month, including original series.
Source: http://ir.netflix.com
3. Abstract
• Highly Available Architecture
• Taxonomy of Outage Failure Modes
• Real World Effects and Mitigation
• Architecture Components from @NetflixOSS
4. Blah Blah Blah
(I’m skipping all the cloud intro etc. Netflix
runs in the cloud, if you hadn’t figured that
out already you aren’t paying attention and
should read slideshare.net/netflix)
6. Things We Do Do…
In production
at Netflix
• Big Data/Hadoop 2009
• AWS Cloud 2009
• Application Performance Management 2010
• Integrated DevOps Practices 2010
• Continuous Integration/Delivery 2010
• NoSQL, Globally Distributed 2010
• Platform as a Service; Micro-Services 2010
• Social coding, open development/github 2011
7. How Netflix Streaming Works
Consumer
Electronics User Data
Browse Web Site or
AWS Cloud
Discovery API
Services
Personalization
CDN Edge
Locations
DRM
Customer Device Play Streaming API
(PC, PS3, TV…)
QoS Logging
CDN
Management and
Steering
OpenConnect
Watch CDN Boxes
Content Encoding
8. Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Each icon is
three to a few
hundred
instances
across three Cassandra
AWS zones
memcached
Web service
Start Here
S3 bucket
Personalization movie
group chooser
10. Three Balanced Availability Zones
Test with Chaos Gorilla
Load Balancers
Zone A Zone B Zone C
Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache
Replicas Replicas Replicas
11. Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Load Balancers
Zone A Zone B Zone C
Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache
Replicas Replicas Replicas
12. Isolated Regions
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
13. Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…
15. Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
• Dynamic configuration options - Archaius
– http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
18. Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
Eureka Services
metadata
AWS Instances, AppDynamics
ASGs, etc. Request flow
Edda Monkeys
19. Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}
20. Distributed Operational Model
• Developers
– Provision and run their own code in production
– Take turns to be on call if it breaks (pagerduty)
– Configure autoscalers to handle capacity needs
• DevOps and PaaS (aka NoOps)
– DevOps is used to build and run the PaaS
– PaaS constrains Dev to use automation instead
– PaaS puts more responsibility on Dev, with tools
21. Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, auto-delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds
25. Zone Failure Modes
• Power Outage
– Instances lost, ephemeral state lost
– Clean break and recovery, fail fast, “no route to host”
• Network Outage
– Instances isolated, state inconsistent
– More complex symptoms, recovery issues, transients
• Dependent Service Outage
– Cascading failures, misbehaving instances, human errors
– Confusing symptoms, recovery issues, byzantine effects
26. Zone Power Failure
• June 29, 2012 AWS US-East - The Big Storm
– http://aws.amazon.com/message/67457/
– http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
• Highlights
– One of 10+ US-East datacenters failed generator startup
– UPS depleted -> 10min power outage for 7% of instances
• Result
– Netflix lost power to most of a zone, evacuated the zone
– Small/brief user impact due to errors and retries
27. Zone Failure Modes
Zone Network Outage
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Zone Dependent
Zone Power Outage
Service Outage
28. Regional Failure Modes
• Network Failure Takes Region Offline
– DNS configuration errors
– Bugs and configuration errors in routers
– Network capacity overload
• Control Plane Overload Affecting Entire Region
– Consequence of other outages
– Lose control of remaining zones infrastructure
– Cascading service failure, hard to diagnose
29. Regional Control Plane Overload
• April 2011 – “The big EBS Outage”
– http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
– Human error during network upgrade triggered cascading failure
– Zone level failure, with brief regional control plane overload
• Netflix Infrastructure Impact
– Instances in one zone hung and could not launch replacements
– Overload prevented other zones from launching instances
– Some MySQL slaves offline for a few days
• Netflix Customer Visible Impact
– Higher latencies for a short time
– Higher error rates for a short time
– Outage was at a low traffic level time, so no capacity issues
30. Regional Failure Modes
Regional Network Outage
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Control Plane Overload
31. Dependent Services Failure
• June 29, 2012 AWS US-East - The Big Storm
– Power failure recovery overloaded EBS storage service
– Backlog of instance startups using EBS root volumes
• ELB (Load Balancer) Impacted
– ELB instances couldn’t scale because EBS was backlogged
– ELB control plane also became backlogged
• Mitigation Plans Mentioned
– Multiple control plane request queues to isolate backlog
– Rapid DNS based traffic shifting between zones
32. Application Routing Failure
June 29, 2012 AWS US-East - The Big Storm
Eureka service directory failed to mark down
dead instances due to a configuration error
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Effect: higher latency and errors
Zone Power Outage Mitigation: Fixed config, and made
Applications not using
zone aware routing the default
Zone-aware routing kept
trying to talk to dead
instances and timing out
33. Dec 24th 2012
Partial Regional ELB Outage
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
• ELB (Load Balancer) Impacted
– ELB control plane database state accidentally corrupted
– Hours to detect, hours to restore from backups
• Mitigation Plans Mentioned
– Tighter process for access to control plane
– Better zone isolation
34. Global Failure Modes
• Software Bugs
– Externally triggered (e.g. leap year/leap second)
– Memory leaks and other delayed action failures
• Global configuration errors
– Usually human error
– Both infrastructure and application level
• Cascading capacity overload
– Customers migrating away from a failure
– Lack of cross region service isolation
35. Global Software Bug Outages
• AWS S3 Global Outage in 2008
– Gossip protocol propagated errors worldwide
– No data loss, but service offline for up to 9hrs
– Extra error detection fixes, no big issues since
• Microsoft Azure Leap Day Outage in 2012
– Bug failed to generate certificates ending 2/29/13
– Failure to launch new instances for up to 13hrs
– One line code fix.
• Netflix Configuration Error in 2012
– Global property updated to broken value
– Streaming stopped worldwide for ~1hr until we changed back
– Fix planned to keep history of properties for quick rollback
36. Global Failure Modes
Cascading Capacity Overload
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Capacity Demand Migrates Software Bugs and Global
Configuration Errors
“Oops…”
37.
38. Managing Multi-Region Availability
AWS DynECT
Route53 UltraDNS DNS
Regional Load Balancers Regional Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
What we need is a portable way to manage multiple DNS providers….
39. Denominator
“The next version is more portable…” for DNS
Edda, Multi-
Use Cases Region
Failover
Common Model Denominator
DNS Vendor Plug-in AWS Route53 DynECT UltraDNS Etc…
API Models (varied IAM Key Auth User/pwd User/pwd
and mostly broken) REST REST SOAP
Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
41. Micro-Service Pattern
One keyspace, replaces a single table or materialized view
Single function Cassandra
Many Different Single-Function REST Clients Cluster Managed by Priam
Between 6 and 72 nodes
Stateless Data Access REST Service
Astyanax Cassandra Client
Optional
Each icon represents a horizontally scaled service of three to Datacenter
hundreds of instances deployed over three availability zones Update Flow
Appdynamics Service Flow Visualization
42. Stateless Micro-Service Architecture
Linux Base AMI (CentOS or Ubuntu)
Optional Apache frontend,
memcached, non-java apps
Java (JDK 6 or 7)
AppDynamics appagent
monitoring
Tomcat
Monitoring
Application war file, base servlet, platform, client Healthcheck, status servlets, JMX interface, Servo
Log rotation to S3 interface jars, Astyanax autoscale
AppDynamics machineagent GC and thread dump logging
Epic/Atlas
43. Astyanax
Available at http://github.com/netflix
• Features
– Complete abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
• Recipes
– Distributed row lock (without zookeeper)
– Multi-DC row lock
– Uniqueness constraint
– Multi-row uniqueness constraint
– Chunked and multi-threaded large file storage
44. Astyanax Query Example
Paginate through all columns in a row
ColumnList<String> columns;
int pageize = 10;
try {
RowQuery<String, String> query = keyspace
.prepareQuery(CF_STANDARD1)
.getKey("A")
.setIsPaginating()
.withColumnRange(new RangeBuilder().setMaxSize(pageize).build());
while (!(columns = query.execute().getResult()).isEmpty()) {
for (Column<String> c : columns) {
}
}
} catch (ConnectionException e) {
}
45. Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Cassandra
•Disks
•Zone A
1. Client Writes to local Cassandra 3 2Cassandra If a node goes
coordinator •Disks4 3•Disks 4 offline, hinted handoff
2. Coodinator writes to •Zone C 1 •Zone B completes the write
2
other zones Token when the node comes
3. Nodes return ack back up.
4. Data written to Aware
internal commit log Clients Requests can choose to
disks (no more than Cassandra Cassandra wait for one node, a
10 seconds later) •Disks •Disks quorum, or all nodes to
•Zone B •Zone C ack the write
Cassandra
3
SSTable disk writes and
•Disks 4 compactions occur
•Zone A
asynchronously
46. Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
1. Client writes to local replicas If a node or region goes offline, hinted handoff
2. Local write acks returned to completes the write when the node comes back up.
Client which continues when Nightly global compare and repair jobs ensure
2 of 3 local nodes are everything stays consistent.
committed
3. Local coordinator writes to
remote coordinator. 100+ms latency
Cassandra Cassandra
4. When data arrives, remote • Disks
• Zone A
• Disks
• Zone A
coordinator node acks and Cassandra 2 2
Cassandra Cassandra 4Cassandra
6
• Disks • Disks 6 3 5• Disks6 4 Disks6
copies to other remote zones • Zone C
1
• Zone B • Zone C
•
• Zone B
4
5. Remote nodes ack to local US EU
coordinator Clients Clients
Cassandra 2
Cassandra Cassandra Cassandra
6. Data flushed to internal • Disks
• Zone B
• Disks
• Zone C
6 • Disks
• Zone B
• Disks
• Zone C
commit log disks (no more Cassandra 5
6Cassandra
• Disks
than 10 seconds later) • Zone A
• Disks
• Zone A
47. Cassandra Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
Tomcat and Priam
on JDK
Java (JDK 7)
Healthcheck, Status
AppDynamics
appagent
monitoring Cassandra Server
Monitoring
Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and
AppDynamics
GC and thread dump SSTables
machineagent
logging
Epic/Atlas
48. Priam – Cassandra Automation
Available at http://github.com/netflix
• Netflix Platform Tomcat Code
• Zero touch auto-configuration
• State management for Cassandra JVM
• Token allocation and assignment
• Broken node auto-replacement
• Full and incremental backup to S3
• Restore sequencing from S3
• Grow/Shrink Cassandra “ring”
49. ETL for Cassandra
• Data is de-normalized over many clusters!
• Too many to restore from backups for ETL
• Solution – read backup files using Hadoop
• Aegisthus
– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html
– High throughput raw SSTable processing
– Re-normalizes many clusters to a consistent view
– Extract, Transform, then Load into Teradata
54. Three Questions
Why is Netflix doing this?
How does it all fit together?
What is coming next?
55. Netflix Deconstructed
Content as a Service on a Platform
Long term strategic Easy to use
barriers to competition
Personalized
Service
Exclusive Agile, Reliable
and Scalable, Secure
Extensive Low cost, Global
Content Platform
Enables the
business, but doesn’t
Netflix differentiate against
large competitors
56. Platform Evolution
2009-2010 2011-2012 2013-2014
Bleeding Edge Common Shared
Innovation Pattern Pattern
Netflix ended up several years ahead of the
industry, but it’s not a sustainable position
57. Making it easy to follow
Exploring the wild west each time vs. laying down a shared route
58. Establish our Hire, Retain and
solutions as Best Engage Top
Practices / Standards Engineers
Goals
Build up Netflix Benefit from a
Technology Brand shared ecosystem
61. Our Current Catalog of Releases
Free code available at http://netflix.github.com
62. Open Source Projects
Legend
Github / Techblog Priam Exhibitor
Servo and Autoscaling Scripts
Apache Contributions
Cassandra as a Service Zookeeper as a Service
Astyanax Curator Genie
Techblog Post
Cassandra client for Java Zookeeper Patterns Hadoop PaaS
Coming Soon
CassJMeter EVCache Hystrix
Cassandra test suite Memcached as a Service Robust service pattern
Cassandra
Eureka / Discovery
Multi-region EC2 datastore RxJava Reactive Patterns
support Service Directory
Asgard
Aegisthus Archaius
AutoScaleGroup based AWS
Hadoop ETL for Cassandra Dynamics Properties Service console
Edda Chaos Monkey
Explorers
Config state with history Robustness verification
Governator
Denominator
Library lifecycle and dependency Latency Monkey
injection (Announce today)
Odin Ribbon
Janitor Monkey
Orchestration for Asgard REST Client + mid-tier LB
Karyon
Blitz4j Async logging Bakeries and AMI
Instrumented REST Base Server
63. NetflixOSS Continuous Build and Deployment
Github Maven AWS
NetflixOSS Central Base AMI
Source
Dynaslave
Jenkins AWS
AWS Build
Bakery Baked AMIs
Slaves
Odin Asgard AWS
Orchestration (+ Frigga) Account
API Console
64. NetflixOSS Services Scope
AWS Account
Asgard Console
Archaius Config
Multiple AWS Regions
Service
Cross region
Priam C* Eureka Registry
Explorers
Dashboards
Exhibitor ZK
3 AWS Zones
Application
Priam Evcache
Atlas Edda History Clusters
Cassandra Memcached
Monitoring Autoscale Groups
Persistent Storage Ephemeral Storage
Instances
Simian Army
Genie Hadoop
Services
65. NetflixOSS Instance Libraries
• Baked AMI – Tomcat, Apache, your code
Initialization • Governator – Guice based dependency injection
• Archaius – dynamic configuration properties client
• Eureka - service registration client
Service • Karyon - Base Server for inbound requests
• RxJava – Reactive pattern
• Hystrix/Turbine – dependencies and real-time status
Requests • Ribbon - REST Client for outbound calls
• Astyanax – Cassandra client and pattern library
Data Access • Evcache – Zone aware Memcached client
• Curator – Zookeeper patterns
• Blitz4j – non-blocking logging
Logging • Servo – metrics export for autoscaling
• Atlas – high volume instrumentation
66. NetflixOSS Testing and Automation
Test Tools • CassJmeter – load testing for C*
• Circus Monkey – test rebalancing
• Janitor Monkey
Maintenance • Efficiency Monkey
• Doctor Monkey
• Howler Monkey
• Chaos Monkey - Instances
Availability • Chaos Gorilla – Availability Zones
• Chaos Kong - Regions
• Latency Monkey – latency and error injection
Security • Security Monkey
• Conformity Monkey
67. What’s Coming Next?
Better portability
Higher availability
More
Features Easier to deploy
Contributions from end users
Contributions from vendors
More Use Cases
68. Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering an ecosystem
Rapid Evolution - Low MTBIAMSH
(Mean Time Between Idea And Making Stuff Happen)
69. Takeaway
Netflix has built and deployed a scalable global and highly available
Platform as a Service.
We encourage you to adopt or extend the NetflixOSS platform
ecosystem.
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/adriancockcroft
@adrianco #netflixcloud @NetflixOSS
Notes de l'éditeur
Content, delivered by a service running on a platform. However our much large competitors also have the same platform advantages.
When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
The railroad made it possible for California to be developed quickly, by creating an easy to follow path we can create a much bigger ecosystem around the Netflix platform
We have shared parts of our platform bit by bit through the year, it’s starting to get traction now
The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.