High Availability Architecture and NetflixOSS

Highly Available Architecture at
Netflix
JPL - February 2013
Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSS
http://www.linkedin.com/in/adriancockcroft

Netflix Inc.

Netflix is the world’s leading Internet television network
with more than 33 million members in 40 countries
enjoying more than one billion hours of TV shows and
movies per month, including original series.

Source: http://ir.netflix.com

Abstract
• Highly Available Architecture

• Taxonomy of Outage Failure Modes

• Real World Effects and Mitigation

• Architecture Components from @NetflixOSS

Blah Blah Blah

(I’m skipping all the cloud intro etc. Netflix
runs in the cloud, if you hadn’t figured that
out already you aren’t paying attention and
should read slideshare.net/netflix)

Things We Do Do…
In production
at Netflix
• Big Data/Hadoop 2009
• AWS Cloud 2009
• Application Performance Management 2010
• Integrated DevOps Practices 2010
• Continuous Integration/Delivery 2010
• NoSQL, Globally Distributed 2010
• Platform as a Service; Micro-Services 2010
• Social coding, open development/github 2011

How Netflix Streaming Works
Consumer
Electronics User Data

Browse Web Site or
AWS Cloud
Discovery API
Services
Personalization
CDN Edge
Locations
DRM
Customer Device Play Streaming API
(PC, PS3, TV…)
QoS Logging

CDN
Management and
Steering
OpenConnect
Watch CDN Boxes
Content Encoding

Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Each icon is
three to a few
hundred
instances
across three Cassandra
AWS zones
memcached
Web service
Start Here
S3 bucket

Personalization movie
group chooser

Component Micro-Services
Test With Chaos Monkey, Latency Monkey

Three Balanced Availability Zones
Test with Chaos Gorilla

Load Balancers

Zone A Zone B Zone C
Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache
Replicas Replicas Replicas

Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Load Balancers

Zone A Zone B Zone C
Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache
Replicas Replicas Replicas

Isolated Regions

US-East Load Balancers EU-West Load Balancers

Zone A Zone B Zone C Zone A Zone B Zone C

Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas

Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive

Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…

Application Resilience

Run what you wrote
Rapid detection
Rapid Response

Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
• Dynamic configuration options - Archaius
– http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html

Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Chaos Monkey
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

• Computers (Datacenter or AWS) randomly die
– Fact of life, but too infrequent to test resiliency

• Test to make sure systems are resilient
– Kill individual instances without customer impact

• Latency Monkey (coming soon)
– Inject extra latency and error return codes

Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html

Eureka Services
metadata

AWS Instances, AppDynamics
ASGs, etc. Request flow

Edda Monkeys

Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]

Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}

Distributed Operational Model
• Developers
– Provision and run their own code in production
– Take turns to be on call if it breaks (pagerduty)
– Configure autoscalers to handle capacity needs

• DevOps and PaaS (aka NoOps)
– DevOps is used to build and run the PaaS
– PaaS constrains Dev to use automation instead
– PaaS puts more responsibility on Dev, with tools

Rapid Rollback
• Use a new Autoscale Group to push code

• Leave existing ASG in place, switch traffic

• If OK, auto-delete old ASG a few hours later

• If “whoops”, switch traffic back in seconds

Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

Platform Outage Taxonomy

Classify and name the different types
of things that can go wrong

Zone Failure Modes
• Power Outage
– Instances lost, ephemeral state lost
– Clean break and recovery, fail fast, “no route to host”

• Network Outage
– Instances isolated, state inconsistent
– More complex symptoms, recovery issues, transients

• Dependent Service Outage
– Cascading failures, misbehaving instances, human errors
– Confusing symptoms, recovery issues, byzantine effects

Zone Power Failure
• June 29, 2012 AWS US-East - The Big Storm
– http://aws.amazon.com/message/67457/
– http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

• Highlights
– One of 10+ US-East datacenters failed generator startup
– UPS depleted -> 10min power outage for 7% of instances

• Result
– Netflix lost power to most of a zone, evacuated the zone
– Small/brief user impact due to errors and retries

Zone Failure Modes
Zone Network Outage




Zone Dependent
Zone Power Outage
Service Outage

Regional Failure Modes
• Network Failure Takes Region Offline
– DNS configuration errors
– Bugs and configuration errors in routers
– Network capacity overload

• Control Plane Overload Affecting Entire Region
– Consequence of other outages
– Lose control of remaining zones infrastructure
– Cascading service failure, hard to diagnose

Regional Control Plane Overload
• April 2011 – “The big EBS Outage”
– http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
– Human error during network upgrade triggered cascading failure
– Zone level failure, with brief regional control plane overload

• Netflix Infrastructure Impact
– Instances in one zone hung and could not launch replacements
– Overload prevented other zones from launching instances
– Some MySQL slaves offline for a few days

• Netflix Customer Visible Impact
– Higher latencies for a short time
– Higher error rates for a short time
– Outage was at a low traffic level time, so no capacity issues

Regional Failure Modes
Regional Network Outage




Control Plane Overload

Dependent Services Failure
• June 29, 2012 AWS US-East - The Big Storm
– Power failure recovery overloaded EBS storage service
– Backlog of instance startups using EBS root volumes

• ELB (Load Balancer) Impacted
– ELB instances couldn’t scale because EBS was backlogged
– ELB control plane also became backlogged

• Mitigation Plans Mentioned
– Multiple control plane request queues to isolate backlog
– Rapid DNS based traffic shifting between zones

Application Routing Failure
June 29, 2012 AWS US-East - The Big Storm

Eureka service directory failed to mark down
dead instances due to a configuration error




Effect: higher latency and errors
Zone Power Outage Mitigation: Fixed config, and made
Applications not using
zone aware routing the default
Zone-aware routing kept
trying to talk to dead
instances and timing out

Dec 24th 2012
Partial Regional ELB Outage




• ELB (Load Balancer) Impacted
– ELB control plane database state accidentally corrupted
– Hours to detect, hours to restore from backups

• Mitigation Plans Mentioned
– Tighter process for access to control plane
– Better zone isolation

Global Failure Modes
• Software Bugs
– Externally triggered (e.g. leap year/leap second)
– Memory leaks and other delayed action failures

• Global configuration errors
– Usually human error
– Both infrastructure and application level

• Cascading capacity overload
– Customers migrating away from a failure
– Lack of cross region service isolation

Global Software Bug Outages
• AWS S3 Global Outage in 2008
– Gossip protocol propagated errors worldwide
– No data loss, but service offline for up to 9hrs
– Extra error detection fixes, no big issues since

• Microsoft Azure Leap Day Outage in 2012
– Bug failed to generate certificates ending 2/29/13
– Failure to launch new instances for up to 13hrs
– One line code fix.

• Netflix Configuration Error in 2012
– Global property updated to broken value
– Streaming stopped worldwide for ~1hr until we changed back
– Fix planned to keep history of properties for quick rollback

Global Failure Modes
Cascading Capacity Overload




Capacity Demand Migrates Software Bugs and Global
Configuration Errors

“Oops…”

Managing Multi-Region Availability

AWS DynECT
Route53 UltraDNS DNS

Regional Load Balancers Regional Load Balancers



What we need is a portable way to manage multiple DNS providers….

Denominator
“The next version is more portable…” for DNS

Edda, Multi-
Use Cases Region
Failover

Common Model Denominator

DNS Vendor Plug-in AWS Route53 DynECT UltraDNS Etc…

API Models (varied IAM Key Auth User/pwd User/pwd
and mostly broken) REST REST SOAP

Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)

Highly Available Storage

A highly scalable, available and
durable deployment pattern

Micro-Service Pattern
One keyspace, replaces a single table or materialized view
Single function Cassandra
Many Different Single-Function REST Clients Cluster Managed by Priam
Between 6 and 72 nodes

Stateless Data Access REST Service
Astyanax Cassandra Client

Optional
Each icon represents a horizontally scaled service of three to Datacenter
hundreds of instances deployed over three availability zones Update Flow
Appdynamics Service Flow Visualization

Stateless Micro-Service Architecture

Linux Base AMI (CentOS or Ubuntu)

Optional Apache frontend,
memcached, non-java apps
Java (JDK 6 or 7)
AppDynamics appagent
monitoring
Tomcat
Monitoring
Application war file, base servlet, platform, client Healthcheck, status servlets, JMX interface, Servo
Log rotation to S3 interface jars, Astyanax autoscale
AppDynamics machineagent GC and thread dump logging
Epic/Atlas

Astyanax
Available at http://github.com/netflix

• Features
– Complete abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
• Recipes
– Distributed row lock (without zookeeper)
– Multi-DC row lock
– Uniqueness constraint
– Multi-row uniqueness constraint
– Chunked and multi-threaded large file storage

Astyanax Query Example
Paginate through all columns in a row
ColumnList<String> columns;
int pageize = 10;
try {
RowQuery<String, String> query = keyspace
.prepareQuery(CF_STANDARD1)
.getKey("A")
.setIsPaginating()
.withColumnRange(new RangeBuilder().setMaxSize(pageize).build());

while (!(columns = query.execute().getResult()).isEmpty()) {
for (Column<String> c : columns) {
}
}
} catch (ConnectionException e) {
}

Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware

Cassandra
•Disks
•Zone A

1. Client Writes to local Cassandra 3 2Cassandra If a node goes
coordinator •Disks4 3•Disks 4 offline, hinted handoff
2. Coodinator writes to •Zone C 1 •Zone B completes the write
2
other zones Token when the node comes
3. Nodes return ack back up.
4. Data written to Aware
internal commit log Clients Requests can choose to
disks (no more than Cassandra Cassandra wait for one node, a
10 seconds later) •Disks •Disks quorum, or all nodes to
•Zone B •Zone C ack the write

Cassandra
3
SSTable disk writes and
•Disks 4 compactions occur
•Zone A
asynchronously

Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum

1. Client writes to local replicas If a node or region goes offline, hinted handoff
2. Local write acks returned to completes the write when the node comes back up.
Client which continues when Nightly global compare and repair jobs ensure
2 of 3 local nodes are everything stays consistent.
committed
3. Local coordinator writes to
remote coordinator. 100+ms latency
Cassandra Cassandra
4. When data arrives, remote • Disks
• Zone A
• Disks
• Zone A

coordinator node acks and Cassandra 2 2
Cassandra Cassandra 4Cassandra
6
• Disks • Disks 6 3 5• Disks6 4 Disks6
copies to other remote zones • Zone C
1
• Zone B • Zone C
•
• Zone B

4
5. Remote nodes ack to local US EU
coordinator Clients Clients
Cassandra 2
Cassandra Cassandra Cassandra
6. Data flushed to internal • Disks
• Zone B
• Disks
• Zone C
6 • Disks
• Zone B
• Disks
• Zone C

commit log disks (no more Cassandra 5
6Cassandra
• Disks
than 10 seconds later) • Zone A
• Disks
• Zone A

Cassandra Instance Architecture

Linux Base AMI (CentOS or Ubuntu)

Tomcat and Priam
on JDK
Java (JDK 7)
Healthcheck, Status

AppDynamics
appagent
monitoring Cassandra Server
Monitoring
Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and
AppDynamics
GC and thread dump SSTables
machineagent
logging
Epic/Atlas

Priam – Cassandra Automation
Available at http://github.com/netflix

• Netflix Platform Tomcat Code
• Zero touch auto-configuration
• State management for Cassandra JVM
• Token allocation and assignment
• Broken node auto-replacement
• Full and incremental backup to S3
• Restore sequencing from S3
• Grow/Shrink Cassandra “ring”

ETL for Cassandra
• Data is de-normalized over many clusters!
• Too many to restore from backups for ETL
• Solution – read backup files using Hadoop
• Aegisthus
– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html

– High throughput raw SSTable processing
– Re-normalizes many clusters to a consistent view
– Extract, Transform, then Load into Teradata

Build Your Own Highly Available
Platform
@NetflixOSS

2013 Roadmap - highlights
• Bakeries
Build and Deploy • Workflow orchestration
• Push-button Launcher

• Sample applications
Recipes • Karyon - Base server

• More Monkeys, Error and Latency Injection
Availability • Denominator (more later…)
• Atlas - monitoring

• Genie – Hadoop Paas
Analytics • Explorers / visualization

• EvCache – persister and memcached service
Persistence • More Astyanax Recipes

Three Questions

Why is Netflix doing this?

How does it all fit together?

What is coming next?

Netflix Deconstructed
Content as a Service on a Platform

Long term strategic Easy to use
barriers to competition
Personalized
Service
Exclusive Agile, Reliable
and Scalable, Secure
Extensive Low cost, Global
Content Platform

Enables the
business, but doesn’t
Netflix differentiate against
large competitors

Platform Evolution

2009-2010 2011-2012 2013-2014

Bleeding Edge Common Shared
Innovation Pattern Pattern

Netflix ended up several years ahead of the
industry, but it’s not a sustainable position

Making it easy to follow
Exploring the wild west each time vs. laying down a shared route

Establish our Hire, Retain and
solutions as Best Engage Top
Practices / Standards Engineers

Goals

Build up Netflix Benefit from a
Technology Brand shared ecosystem

Progress during 2012
From pushing the platform uphill to runaway success

Our Current Catalog of Releases
Free code available at http://netflix.github.com

Open Source Projects
Legend
Github / Techblog Priam Exhibitor
Servo and Autoscaling Scripts
Apache Contributions
Cassandra as a Service Zookeeper as a Service
Astyanax Curator Genie
Techblog Post
Cassandra client for Java Zookeeper Patterns Hadoop PaaS
Coming Soon
CassJMeter EVCache Hystrix
Cassandra test suite Memcached as a Service Robust service pattern
Cassandra
Eureka / Discovery
Multi-region EC2 datastore RxJava Reactive Patterns
support Service Directory
Asgard
Aegisthus Archaius
AutoScaleGroup based AWS
Hadoop ETL for Cassandra Dynamics Properties Service console
Edda Chaos Monkey
Explorers
Config state with history Robustness verification
Governator
Denominator
Library lifecycle and dependency Latency Monkey
injection (Announce today)
Odin Ribbon
Janitor Monkey
Orchestration for Asgard REST Client + mid-tier LB
Karyon
Blitz4j Async logging Bakeries and AMI
Instrumented REST Base Server

NetflixOSS Continuous Build and Deployment

Github Maven AWS
NetflixOSS Central Base AMI
Source

Dynaslave
Jenkins AWS
AWS Build
Bakery Baked AMIs
Slaves

Odin Asgard AWS
Orchestration (+ Frigga) Account
API Console

NetflixOSS Services Scope

AWS Account
Asgard Console

Archaius Config
Multiple AWS Regions
Service

Cross region
Priam C* Eureka Registry

Explorers
Dashboards
Exhibitor ZK
3 AWS Zones
Application
Priam Evcache
Atlas Edda History Clusters
Cassandra Memcached
Monitoring Autoscale Groups
Persistent Storage Ephemeral Storage
Instances
Simian Army
Genie Hadoop
Services

NetflixOSS Instance Libraries

• Baked AMI – Tomcat, Apache, your code

Initialization • Governator – Guice based dependency injection
• Archaius – dynamic configuration properties client
• Eureka - service registration client

Service • Karyon - Base Server for inbound requests
• RxJava – Reactive pattern
• Hystrix/Turbine – dependencies and real-time status
Requests • Ribbon - REST Client for outbound calls

• Astyanax – Cassandra client and pattern library
Data Access • Evcache – Zone aware Memcached client
• Curator – Zookeeper patterns

• Blitz4j – non-blocking logging
Logging • Servo – metrics export for autoscaling
• Atlas – high volume instrumentation

NetflixOSS Testing and Automation

Test Tools • CassJmeter – load testing for C*
• Circus Monkey – test rebalancing

• Janitor Monkey

Maintenance • Efficiency Monkey
• Doctor Monkey
• Howler Monkey

• Chaos Monkey - Instances

Availability • Chaos Gorilla – Availability Zones
• Chaos Kong - Regions
• Latency Monkey – latency and error injection

Security • Security Monkey
• Conformity Monkey

What’s Coming Next?

Better portability

Higher availability
More
Features Easier to deploy

Contributions from end users

Contributions from vendors

More Use Cases

Functionality and scale now, portability coming

Moving from parts to a platform in 2013

Netflix is fostering an ecosystem

Rapid Evolution - Low MTBIAMSH
(Mean Time Between Idea And Making Stuff Happen)

Takeaway

Netflix has built and deployed a scalable global and highly available
Platform as a Service.

We encourage you to adopt or extend the NetflixOSS platform
ecosystem.

http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft

@adrianco #netflixcloud @NetflixOSS

High Availability Architecture and NetflixOSS

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Plus de Adrian Cockcroft

Plus de Adrian Cockcroft (14)

Dernier

Dernier (20)

High Availability Architecture and NetflixOSS

Notes de l'éditeur