SlideShare une entreprise Scribd logo
1  sur  69
Highly Available Architecture at
             Netflix
            JPL - February 2013
             Adrian Cockcroft
        @adrianco #netflixcloud @NetflixOSS
     http://www.linkedin.com/in/adriancockcroft
Netflix Inc.


Netflix is the world’s leading Internet television network
   with more than 33 million members in 40 countries
  enjoying more than one billion hours of TV shows and
        movies per month, including original series.




Source: http://ir.netflix.com
Abstract
• Highly Available Architecture

• Taxonomy of Outage Failure Modes

• Real World Effects and Mitigation

• Architecture Components from @NetflixOSS
Blah Blah                           Blah

 (I’m skipping all the cloud intro etc. Netflix
 runs in the cloud, if you hadn’t figured that
 out already you aren’t paying attention and
      should read slideshare.net/netflix)
Things we don’t do
Things We Do Do…
                                             In production
                                             at Netflix
•   Big Data/Hadoop                          2009
•   AWS Cloud                                2009
•   Application Performance Management       2010
•   Integrated DevOps Practices              2010
•   Continuous Integration/Delivery          2010
•   NoSQL, Globally Distributed              2010
•   Platform as a Service; Micro-Services    2010
•   Social coding, open development/github   2011
How Netflix Streaming Works
Consumer
Electronics                                                  User Data

                            Browse         Web Site or
AWS Cloud
                                          Discovery API
 Services
                                                           Personalization
CDN Edge
Locations
                                                                DRM
                 Customer Device   Play   Streaming API
                  (PC, PS3, TV…)
                                                            QoS Logging


                                                               CDN
                                                          Management and
                                                             Steering
                                          OpenConnect
                             Watch         CDN Boxes
                                                          Content Encoding
Web Server Dependencies Flow
         (Home page business transaction as seen by AppDynamics)
Each icon is
three to a few
hundred
instances
across three                                   Cassandra
AWS zones
                                                           memcached
                                                      Web service
         Start Here
                                                           S3 bucket




Personalization movie
group chooser
Component Micro-Services
  Test With Chaos Monkey, Latency Monkey
Three Balanced Availability Zones
                        Test with Chaos Gorilla

                             Load Balancers




       Zone A                      Zone B                  Zone C
Cassandra and Evcache       Cassandra and Evcache   Cassandra and Evcache
      Replicas                    Replicas                Replicas
Triple Replicated Persistence
        Cassandra maintenance affects individual replicas
                           Load Balancers




       Zone A                    Zone B                  Zone C
Cassandra and Evcache     Cassandra and Evcache   Cassandra and Evcache
      Replicas                  Replicas                Replicas
Isolated Regions

                     US-East Load Balancers                                              EU-West Load Balancers




     Zone A                      Zone B                  Zone C               Zone A                 Zone B               Zone C

Cassandra Replicas          Cassandra Replicas      Cassandra Replicas   Cassandra Replicas     Cassandra Replicas   Cassandra Replicas
Failure Modes and Effects
Failure Mode                  Probability   Current Mitigation Plan
Application Failure           High          Automatic degraded response
AWS Region Failure            Low           Wait for region to recover
AWS Zone Failure              Medium        Continue to run on 2 out of 3 zones
Datacenter Failure            Medium        Migrate more functions to cloud
Data store failure            Low           Restore from S3 backups
S3 failure                    Low           Restore from remote archive



                Until we got really good at mitigating high and medium
                probability failures, the ROI for mitigating regional
                failures didn’t make sense. Getting there…
Application Resilience

   Run what you wrote
     Rapid detection
     Rapid Response
Run What You Wrote
• Make developers responsible for failures
  – Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
  – Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
  – Don’t let cascading timeouts stack up
• Dynamic configuration options - Archaius
  – http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Chaos Monkey
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html


• Computers (Datacenter or AWS) randomly die
   – Fact of life, but too infrequent to test resiliency


• Test to make sure systems are resilient
   – Kill individual instances without customer impact


• Latency Monkey (coming soon)
   – Inject extra latency and error return codes
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html



                              Eureka Services
                                 metadata



         AWS Instances,                             AppDynamics
           ASGs, etc.                               Request flow




                              Edda              Monkeys
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
 ["i-0123456789","i-012345678a","i-012345678b”]

Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
 {
…
      "ipRanges" : [
        "10.10.1.1/32",
        "10.10.1.2/32",
+        "10.10.1.3/32",
-       "10.10.1.4/32"
…
 }
Distributed Operational Model
• Developers
  – Provision and run their own code in production
  – Take turns to be on call if it breaks (pagerduty)
  – Configure autoscalers to handle capacity needs

• DevOps and PaaS (aka NoOps)
  – DevOps is used to build and run the PaaS
  – PaaS constrains Dev to use automation instead
  – PaaS puts more responsibility on Dev, with tools
Rapid Rollback
• Use a new Autoscale Group to push code

• Leave existing ASG in place, switch traffic

• If OK, auto-delete old ASG a few hours later

• If “whoops”, switch traffic back in seconds
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Platform Outage Taxonomy

Classify and name the different types
     of things that can go wrong
YOLO
Zone Failure Modes
• Power Outage
  – Instances lost, ephemeral state lost
  – Clean break and recovery, fail fast, “no route to host”

• Network Outage
  – Instances isolated, state inconsistent
  – More complex symptoms, recovery issues, transients

• Dependent Service Outage
  – Cascading failures, misbehaving instances, human errors
  – Confusing symptoms, recovery issues, byzantine effects
Zone Power Failure
• June 29, 2012 AWS US-East - The Big Storm
   – http://aws.amazon.com/message/67457/
   – http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html



• Highlights
   – One of 10+ US-East datacenters failed generator startup
   – UPS depleted -> 10min power outage for 7% of instances

• Result
   – Netflix lost power to most of a zone, evacuated the zone
   – Small/brief user impact due to errors and retries
Zone Failure Modes
                 Zone Network Outage


                        US-East Load Balancers                                              EU-West Load Balancers




        Zone A                      Zone B                  Zone C               Zone A                 Zone B               Zone C

   Cassandra Replicas          Cassandra Replicas      Cassandra Replicas   Cassandra Replicas     Cassandra Replicas   Cassandra Replicas




                                                    Zone Dependent
Zone Power Outage
                                                    Service Outage
Regional Failure Modes
• Network Failure Takes Region Offline
  – DNS configuration errors
  – Bugs and configuration errors in routers
  – Network capacity overload


• Control Plane Overload Affecting Entire Region
  – Consequence of other outages
  – Lose control of remaining zones infrastructure
  – Cascading service failure, hard to diagnose
Regional Control Plane Overload
• April 2011 – “The big EBS Outage”
   – http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
   – Human error during network upgrade triggered cascading failure
   – Zone level failure, with brief regional control plane overload

• Netflix Infrastructure Impact
   – Instances in one zone hung and could not launch replacements
   – Overload prevented other zones from launching instances
   – Some MySQL slaves offline for a few days

• Netflix Customer Visible Impact
   – Higher latencies for a short time
   – Higher error rates for a short time
   – Outage was at a low traffic level time, so no capacity issues
Regional Failure Modes
        Regional Network Outage


                     US-East Load Balancers                                             EU-West Load Balancers




     Zone A                      Zone B                 Zone C               Zone A                 Zone B               Zone C

Cassandra Replicas          Cassandra Replicas     Cassandra Replicas   Cassandra Replicas     Cassandra Replicas   Cassandra Replicas




                                           Control Plane Overload
Dependent Services Failure
• June 29, 2012 AWS US-East - The Big Storm
   – Power failure recovery overloaded EBS storage service
   – Backlog of instance startups using EBS root volumes

• ELB (Load Balancer) Impacted
   – ELB instances couldn’t scale because EBS was backlogged
   – ELB control plane also became backlogged

• Mitigation Plans Mentioned
   – Multiple control plane request queues to isolate backlog
   – Rapid DNS based traffic shifting between zones
Application Routing Failure
                              June 29, 2012 AWS US-East - The Big Storm

                                  Eureka service directory failed to mark down
                                  dead instances due to a configuration error

                        US-East Load Balancers                                                   EU-West Load Balancers




        Zone A                      Zone B                       Zone C               Zone A                 Zone B               Zone C

   Cassandra Replicas          Cassandra Replicas           Cassandra Replicas   Cassandra Replicas     Cassandra Replicas   Cassandra Replicas




                                                                                             Effect: higher latency and errors
Zone Power Outage                                                                            Mitigation: Fixed config, and made
                                                    Applications not using
                                                                                             zone aware routing the default
                                                    Zone-aware routing kept
                                                    trying to talk to dead
                                                    instances and timing out
Dec 24th 2012
         Partial Regional ELB Outage

                      US-East Load Balancers                                            EU-West Load Balancers




      Zone A                      Zone B                Zone C               Zone A                 Zone B               Zone C

 Cassandra Replicas          Cassandra Replicas    Cassandra Replicas   Cassandra Replicas     Cassandra Replicas   Cassandra Replicas




• ELB (Load Balancer) Impacted
         – ELB control plane database state accidentally corrupted
         – Hours to detect, hours to restore from backups

• Mitigation Plans Mentioned
         – Tighter process for access to control plane
         – Better zone isolation
Global Failure Modes
• Software Bugs
   – Externally triggered (e.g. leap year/leap second)
   – Memory leaks and other delayed action failures

• Global configuration errors
   – Usually human error
   – Both infrastructure and application level

• Cascading capacity overload
   – Customers migrating away from a failure
   – Lack of cross region service isolation
Global Software Bug Outages
• AWS S3 Global Outage in 2008
    – Gossip protocol propagated errors worldwide
    – No data loss, but service offline for up to 9hrs
    – Extra error detection fixes, no big issues since

• Microsoft Azure Leap Day Outage in 2012
    – Bug failed to generate certificates ending 2/29/13
    – Failure to launch new instances for up to 13hrs
    – One line code fix.

• Netflix Configuration Error in 2012
    – Global property updated to broken value
    – Streaming stopped worldwide for ~1hr until we changed back
    – Fix planned to keep history of properties for quick rollback
Global Failure Modes
        Cascading Capacity Overload


                     US-East Load Balancers                                                     EU-West Load Balancers




     Zone A                      Zone B               Zone C                         Zone A                 Zone B               Zone C

Cassandra Replicas          Cassandra Replicas   Cassandra Replicas             Cassandra Replicas     Cassandra Replicas   Cassandra Replicas




Capacity Demand Migrates Software Bugs and Global
                           Configuration Errors

                                                                      “Oops…”
Managing Multi-Region Availability

                                         AWS                                                           DynECT
                                        Route53                        UltraDNS                         DNS


                Regional Load Balancers                                                           Regional Load Balancers




     Zone A                  Zone B                    Zone C                          Zone A                    Zone B               Zone C

Cassandra Replicas      Cassandra Replicas        Cassandra Replicas              Cassandra Replicas        Cassandra Replicas   Cassandra Replicas




        What we need is a portable way to manage multiple DNS providers….
Denominator
           “The next version is more portable…” for DNS

                                                    Edda, Multi-
    Use Cases                                         Region
                                                      Failover




 Common Model                                    Denominator




DNS Vendor Plug-in      AWS Route53        DynECT              UltraDNS     Etc…




API Models (varied      IAM Key Auth      User/pwd             User/pwd
and mostly broken)         REST             REST                 SOAP


 Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
Highly Available Storage

A highly scalable, available and
 durable deployment pattern
Micro-Service Pattern
     One keyspace, replaces a single table or materialized view
                                                                       Single function Cassandra
  Many Different Single-Function REST Clients                          Cluster Managed by Priam
                                                                       Between 6 and 72 nodes

                                     Stateless Data Access REST Service
                                     Astyanax Cassandra Client




                                                                                 Optional
Each icon represents a horizontally scaled service of three to                   Datacenter
hundreds of instances deployed over three availability zones                     Update Flow
                              Appdynamics Service Flow Visualization
Stateless Micro-Service Architecture

Linux Base AMI (CentOS or Ubuntu)

Optional Apache frontend,
memcached, non-java apps
                            Java (JDK 6 or 7)
                              AppDynamics appagent
                                   monitoring
                                                         Tomcat
       Monitoring
                                                         Application war file, base servlet, platform, client   Healthcheck, status servlets, JMX interface, Servo
    Log rotation to S3                                                interface jars, Astyanax                                      autoscale
AppDynamics machineagent    GC and thread dump logging
        Epic/Atlas
Astyanax
               Available at http://github.com/netflix

• Features
  –   Complete abstraction of connection pool from RPC protocol
  –   Fluent Style API
  –   Operation retry with backoff
  –   Token aware
• Recipes
  –   Distributed row lock (without zookeeper)
  –   Multi-DC row lock
  –   Uniqueness constraint
  –   Multi-row uniqueness constraint
  –   Chunked and multi-threaded large file storage
Astyanax Query Example
Paginate through all columns in a row
ColumnList<String> columns;
int pageize = 10;
try {
  RowQuery<String, String> query = keyspace
      .prepareQuery(CF_STANDARD1)
      .getKey("A")
      .setIsPaginating()
      .withColumnRange(new RangeBuilder().setMaxSize(pageize).build());

   while (!(columns = query.execute().getResult()).isEmpty()) {
     for (Column<String> c : columns) {
     }
   }
} catch (ConnectionException e) {
}
Astyanax - Cassandra Write Data Flows
           Single Region, Multiple Availability Zone, Token Aware

                                          Cassandra
                                          •Disks
                                          •Zone A

1. Client Writes to local   Cassandra 3                 2Cassandra   If a node goes
   coordinator              •Disks4                     3•Disks 4    offline, hinted handoff
2. Coodinator writes to     •Zone C           1          •Zone B     completes the write
                                                         2
   other zones                            Token                      when the node comes
3. Nodes return ack                                                  back up.
4. Data written to                        Aware
   internal commit log                    Clients                    Requests can choose to
   disks (no more than      Cassandra                    Cassandra   wait for one node, a
   10 seconds later)        •Disks                       •Disks      quorum, or all nodes to
                            •Zone B                      •Zone C     ack the write

                                          Cassandra
                                                    3
                                                                     SSTable disk writes and
                                          •Disks    4                compactions occur
                                          •Zone A
                                                                     asynchronously
Data Flows for Multi-Region Writes
          Token Aware, Consistency Level = Local Quorum

1. Client writes to local replicas                   If a node or region goes offline, hinted handoff
2. Local write acks returned to                      completes the write when the node comes back up.
   Client which continues when                       Nightly global compare and repair jobs ensure
   2 of 3 local nodes are                            everything stays consistent.
   committed
3. Local coordinator writes to
   remote coordinator.                                                       100+ms latency
                                                      Cassandra                                       Cassandra
4. When data arrives, remote                          • Disks
                                                      • Zone A
                                                                                                      • Disks
                                                                                                      • Zone A

   coordinator node acks and         Cassandra   2                2
                                                                  Cassandra              Cassandra                4Cassandra
                                          6
                                     • Disks                      • Disks 6 3           5• Disks6                 4 Disks6
   copies to other remote zones      • Zone C
                                                           1
                                                                  • Zone B               • Zone C
                                                                                                                   •
                                                                                                                   • Zone B

                                                                                                                        4
5. Remote nodes ack to local                           US                                              EU
   coordinator                                       Clients                                         Clients
                                     Cassandra                        2
                                                                  Cassandra              Cassandra                 Cassandra
6. Data flushed to internal          • Disks
                                     • Zone B
                                                                  • Disks
                                                                  • Zone C
                                                                          6              • Disks
                                                                                         • Zone B
                                                                                                                   • Disks
                                                                                                                   • Zone C

   commit log disks (no more                          Cassandra                                              5
                                                                                                     6Cassandra
                                                      • Disks
   than 10 seconds later)                             • Zone A
                                                                                                      • Disks
                                                                                                      • Zone A
Cassandra Instance Architecture

Linux Base AMI (CentOS or Ubuntu)

Tomcat and Priam
    on JDK
                      Java (JDK 7)
Healthcheck, Status


                         AppDynamics
                           appagent
                          monitoring       Cassandra Server
    Monitoring
                                           Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and
  AppDynamics
                      GC and thread dump                                     SSTables
  machineagent
                            logging
   Epic/Atlas
Priam – Cassandra Automation
           Available at http://github.com/netflix

•   Netflix Platform Tomcat Code
•   Zero touch auto-configuration
•   State management for Cassandra JVM
•   Token allocation and assignment
•   Broken node auto-replacement
•   Full and incremental backup to S3
•   Restore sequencing from S3
•   Grow/Shrink Cassandra “ring”
ETL for Cassandra
•   Data is de-normalized over many clusters!
•   Too many to restore from backups for ETL
•   Solution – read backup files using Hadoop
•   Aegisthus
    – http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html

    – High throughput raw SSTable processing
    – Re-normalizes many clusters to a consistent view
    – Extract, Transform, then Load into Teradata
Build Your Own Highly Available
            Platform
          @NetflixOSS
Assembling the Puzzle
2013 Roadmap - highlights
                   • Bakeries
Build and Deploy   • Workflow orchestration
                   • Push-button Launcher


                   • Sample applications
   Recipes         • Karyon - Base server


                   • More Monkeys, Error and Latency Injection
  Availability     • Denominator (more later…)
                   • Atlas - monitoring


                   • Genie – Hadoop Paas
   Analytics       • Explorers / visualization


                   • EvCache – persister and memcached service
  Persistence      • More Astyanax Recipes
Recipes & Launcher
Three Questions


 Why is Netflix doing this?

How does it all fit together?

   What is coming next?
Netflix Deconstructed
             Content as a Service on a Platform

Long term strategic       Easy to use
barriers to competition
                          Personalized
                            Service
           Exclusive                      Agile, Reliable
              and                        Scalable, Secure
           Extensive                     Low cost, Global
           Content                          Platform

                                                Enables the
                                                business, but doesn’t
                          Netflix               differentiate against
                                                large competitors
Platform Evolution


  2009-2010                2011-2012                      2013-2014


Bleeding Edge              Common                         Shared
  Innovation                Pattern                       Pattern

          Netflix ended up several years ahead of the
          industry, but it’s not a sustainable position
Making it easy to follow
Exploring the wild west each time   vs. laying down a shared route
Establish our            Hire, Retain and
  solutions as Best            Engage Top
Practices / Standards           Engineers


                     Goals


  Build up Netflix             Benefit from a
 Technology Brand            shared ecosystem
Progress during 2012
From pushing the platform uphill   to runaway success
How does it all fit together?
Our Current Catalog of Releases
Free code available at http://netflix.github.com
Open Source Projects
         Legend
 Github / Techblog                 Priam                           Exhibitor
                                                                                          Servo and Autoscaling Scripts
Apache Contributions
                           Cassandra as a Service            Zookeeper as a Service
                                  Astyanax                          Curator                          Genie
   Techblog Post
                          Cassandra client for Java           Zookeeper Patterns                 Hadoop PaaS
   Coming Soon
                                CassJMeter                          EVCache                          Hystrix
                            Cassandra test suite            Memcached as a Service           Robust service pattern
                                  Cassandra
                                                               Eureka / Discovery
                         Multi-region EC2 datastore                                         RxJava Reactive Patterns
                                  support                      Service Directory
                                                                                                     Asgard
                                 Aegisthus                          Archaius
                                                                                           AutoScaleGroup based AWS
                         Hadoop ETL for Cassandra         Dynamics Properties Service               console
                                                                     Edda                        Chaos Monkey
                                  Explorers
                                                            Config state with history        Robustness verification
                                 Governator
                                                                 Denominator
                       Library lifecycle and dependency                                         Latency Monkey
                                    injection                  (Announce today)
                                    Odin                            Ribbon
                                                                                                Janitor Monkey
                         Orchestration for Asgard           REST Client + mid-tier LB
                                                                     Karyon
                            Blitz4j Async logging                                              Bakeries and AMI
                                                          Instrumented REST Base Server
NetflixOSS Continuous Build and Deployment

  Github           Maven            AWS
 NetflixOSS        Central        Base AMI
  Source




                  Dynaslave
   Jenkins                           AWS
                  AWS Build
   Bakery                         Baked AMIs
                   Slaves




    Odin           Asgard           AWS
 Orchestration    (+ Frigga)       Account
     API           Console
NetflixOSS Services Scope


AWS Account
Asgard Console


Archaius Config
                  Multiple AWS Regions
    Service


 Cross region
  Priam C*        Eureka Registry


  Explorers
 Dashboards
                   Exhibitor ZK
                                    3 AWS Zones
                                      Application
                                                             Priam              Evcache
     Atlas         Edda History        Clusters
                                                          Cassandra           Memcached
  Monitoring                        Autoscale Groups
                                                       Persistent Storage   Ephemeral Storage
                                       Instances
                   Simian Army
Genie Hadoop
  Services
NetflixOSS Instance Libraries

                 • Baked AMI – Tomcat, Apache, your code

Initialization   • Governator – Guice based dependency injection
                 • Archaius – dynamic configuration properties client
                 • Eureka - service registration client




  Service        • Karyon - Base Server for inbound requests
                 • RxJava – Reactive pattern
                 • Hystrix/Turbine – dependencies and real-time status
 Requests        • Ribbon - REST Client for outbound calls



                 • Astyanax – Cassandra client and pattern library
Data Access      • Evcache – Zone aware Memcached client
                 • Curator – Zookeeper patterns




                 • Blitz4j – non-blocking logging
  Logging        • Servo – metrics export for autoscaling
                 • Atlas – high volume instrumentation
NetflixOSS Testing and Automation


 Test Tools    • CassJmeter – load testing for C*
               • Circus Monkey – test rebalancing




               • Janitor Monkey

Maintenance    • Efficiency Monkey
               • Doctor Monkey
               • Howler Monkey



               • Chaos Monkey - Instances

Availability   • Chaos Gorilla – Availability Zones
               • Chaos Kong - Regions
               • Latency Monkey – latency and error injection




  Security     • Security Monkey
               • Conformity Monkey
What’s Coming Next?

           Better portability

           Higher availability
 More
Features   Easier to deploy

           Contributions from end users

           Contributions from vendors

                     More Use Cases
Functionality and scale now, portability coming

   Moving from parts to a platform in 2013

       Netflix is fostering an ecosystem

      Rapid Evolution - Low MTBIAMSH
      (Mean Time Between Idea And Making Stuff Happen)
Takeaway

Netflix has built and deployed a scalable global and highly available
                         Platform as a Service.

  We encourage you to adopt or extend the NetflixOSS platform
                          ecosystem.

                        http://netflix.github.com
                       http://techblog.netflix.com
                       http://slideshare.net/Netflix

                http://www.linkedin.com/in/adriancockcroft

                   @adrianco #netflixcloud @NetflixOSS

Contenu connexe

En vedette

Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionAdrian Cockcroft
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumAdrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Adrian Cockcroft
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsAdrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Adrian Cockcroft
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
YOLO: You Only Launch Once
YOLO: You Only Launch OnceYOLO: You Only Launch Once
YOLO: You Only Launch OnceDanny Boice
 
Введение в Apache Cassandra
Введение в Apache CassandraВведение в Apache Cassandra
Введение в Apache CassandraAlexander Tivelkov
 
When Developers Operate and Operators Develop
When Developers Operate and Operators DevelopWhen Developers Operate and Operators Develop
When Developers Operate and Operators DevelopAdrian Cockcroft
 
Openstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock InOpenstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock InAdrian Cockcroft
 

En vedette (18)

Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
#Yolo
#Yolo#Yolo
#Yolo
 
YOLO
YOLOYOLO
YOLO
 
YOLO: You Only Launch Once
YOLO: You Only Launch OnceYOLO: You Only Launch Once
YOLO: You Only Launch Once
 
Введение в Apache Cassandra
Введение в Apache CassandraВведение в Apache Cassandra
Введение в Apache Cassandra
 
AWS Webcast - Disaster Recovery
AWS Webcast - Disaster RecoveryAWS Webcast - Disaster Recovery
AWS Webcast - Disaster Recovery
 
When Developers Operate and Operators Develop
When Developers Operate and Operators DevelopWhen Developers Operate and Operators Develop
When Developers Operate and Operators Develop
 
Openstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock InOpenstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock In
 

Plus de Adrian Cockcroft

Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowAdrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconAdrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Adrian Cockcroft
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Adrian Cockcroft
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Adrian Cockcroft
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connectAdrian Cockcroft
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is uselessAdrian Cockcroft
 

Plus de Adrian Cockcroft (14)

Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 

Dernier

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

High Availability Architecture and NetflixOSS

  • 1. Highly Available Architecture at Netflix JPL - February 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  • 2. Netflix Inc. Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series. Source: http://ir.netflix.com
  • 3. Abstract • Highly Available Architecture • Taxonomy of Outage Failure Modes • Real World Effects and Mitigation • Architecture Components from @NetflixOSS
  • 4. Blah Blah Blah (I’m skipping all the cloud intro etc. Netflix runs in the cloud, if you hadn’t figured that out already you aren’t paying attention and should read slideshare.net/netflix)
  • 6. Things We Do Do… In production at Netflix • Big Data/Hadoop 2009 • AWS Cloud 2009 • Application Performance Management 2010 • Integrated DevOps Practices 2010 • Continuous Integration/Delivery 2010 • NoSQL, Globally Distributed 2010 • Platform as a Service; Micro-Services 2010 • Social coding, open development/github 2011
  • 7. How Netflix Streaming Works Consumer Electronics User Data Browse Web Site or AWS Cloud Discovery API Services Personalization CDN Edge Locations DRM Customer Device Play Streaming API (PC, PS3, TV…) QoS Logging CDN Management and Steering OpenConnect Watch CDN Boxes Content Encoding
  • 8. Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Each icon is three to a few hundred instances across three Cassandra AWS zones memcached Web service Start Here S3 bucket Personalization movie group chooser
  • 9. Component Micro-Services Test With Chaos Monkey, Latency Monkey
  • 10. Three Balanced Availability Zones Test with Chaos Gorilla Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  • 11. Triple Replicated Persistence Cassandra maintenance affects individual replicas Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  • 12. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 13. Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  • 14. Application Resilience Run what you wrote Rapid detection Rapid Response
  • 15. Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up • Dynamic configuration options - Archaius – http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
  • 16. Resilient Design – Hystrix, RxJava http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  • 17. Chaos Monkey http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html • Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency • Test to make sure systems are resilient – Kill individual instances without customer impact • Latency Monkey (coming soon) – Inject extra latency and error return codes
  • 18. Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html Eureka Services metadata AWS Instances, AppDynamics ASGs, etc. Request flow Edda Monkeys
  • 19. Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" … }
  • 20. Distributed Operational Model • Developers – Provision and run their own code in production – Take turns to be on call if it breaks (pagerduty) – Configure autoscalers to handle capacity needs • DevOps and PaaS (aka NoOps) – DevOps is used to build and run the PaaS – PaaS constrains Dev to use automation instead – PaaS puts more responsibility on Dev, with tools
  • 21. Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
  • 23. Platform Outage Taxonomy Classify and name the different types of things that can go wrong
  • 24. YOLO
  • 25. Zone Failure Modes • Power Outage – Instances lost, ephemeral state lost – Clean break and recovery, fail fast, “no route to host” • Network Outage – Instances isolated, state inconsistent – More complex symptoms, recovery issues, transients • Dependent Service Outage – Cascading failures, misbehaving instances, human errors – Confusing symptoms, recovery issues, byzantine effects
  • 26. Zone Power Failure • June 29, 2012 AWS US-East - The Big Storm – http://aws.amazon.com/message/67457/ – http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html • Highlights – One of 10+ US-East datacenters failed generator startup – UPS depleted -> 10min power outage for 7% of instances • Result – Netflix lost power to most of a zone, evacuated the zone – Small/brief user impact due to errors and retries
  • 27. Zone Failure Modes Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Zone Dependent Zone Power Outage Service Outage
  • 28. Regional Failure Modes • Network Failure Takes Region Offline – DNS configuration errors – Bugs and configuration errors in routers – Network capacity overload • Control Plane Overload Affecting Entire Region – Consequence of other outages – Lose control of remaining zones infrastructure – Cascading service failure, hard to diagnose
  • 29. Regional Control Plane Overload • April 2011 – “The big EBS Outage” – http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html – Human error during network upgrade triggered cascading failure – Zone level failure, with brief regional control plane overload • Netflix Infrastructure Impact – Instances in one zone hung and could not launch replacements – Overload prevented other zones from launching instances – Some MySQL slaves offline for a few days • Netflix Customer Visible Impact – Higher latencies for a short time – Higher error rates for a short time – Outage was at a low traffic level time, so no capacity issues
  • 30. Regional Failure Modes Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Control Plane Overload
  • 31. Dependent Services Failure • June 29, 2012 AWS US-East - The Big Storm – Power failure recovery overloaded EBS storage service – Backlog of instance startups using EBS root volumes • ELB (Load Balancer) Impacted – ELB instances couldn’t scale because EBS was backlogged – ELB control plane also became backlogged • Mitigation Plans Mentioned – Multiple control plane request queues to isolate backlog – Rapid DNS based traffic shifting between zones
  • 32. Application Routing Failure June 29, 2012 AWS US-East - The Big Storm Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Effect: higher latency and errors Zone Power Outage Mitigation: Fixed config, and made Applications not using zone aware routing the default Zone-aware routing kept trying to talk to dead instances and timing out
  • 33. Dec 24th 2012 Partial Regional ELB Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas • ELB (Load Balancer) Impacted – ELB control plane database state accidentally corrupted – Hours to detect, hours to restore from backups • Mitigation Plans Mentioned – Tighter process for access to control plane – Better zone isolation
  • 34. Global Failure Modes • Software Bugs – Externally triggered (e.g. leap year/leap second) – Memory leaks and other delayed action failures • Global configuration errors – Usually human error – Both infrastructure and application level • Cascading capacity overload – Customers migrating away from a failure – Lack of cross region service isolation
  • 35. Global Software Bug Outages • AWS S3 Global Outage in 2008 – Gossip protocol propagated errors worldwide – No data loss, but service offline for up to 9hrs – Extra error detection fixes, no big issues since • Microsoft Azure Leap Day Outage in 2012 – Bug failed to generate certificates ending 2/29/13 – Failure to launch new instances for up to 13hrs – One line code fix. • Netflix Configuration Error in 2012 – Global property updated to broken value – Streaming stopped worldwide for ~1hr until we changed back – Fix planned to keep history of properties for quick rollback
  • 36. Global Failure Modes Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Capacity Demand Migrates Software Bugs and Global Configuration Errors “Oops…”
  • 37.
  • 38. Managing Multi-Region Availability AWS DynECT Route53 UltraDNS DNS Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas What we need is a portable way to manage multiple DNS providers….
  • 39. Denominator “The next version is more portable…” for DNS Edda, Multi- Use Cases Region Failover Common Model Denominator DNS Vendor Plug-in AWS Route53 DynECT UltraDNS Etc… API Models (varied IAM Key Auth User/pwd User/pwd and mostly broken) REST REST SOAP Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
  • 40. Highly Available Storage A highly scalable, available and durable deployment pattern
  • 41. Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Many Different Single-Function REST Clients Cluster Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client Optional Each icon represents a horizontally scaled service of three to Datacenter hundreds of instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization
  • 42. Stateless Micro-Service Architecture Linux Base AMI (CentOS or Ubuntu) Optional Apache frontend, memcached, non-java apps Java (JDK 6 or 7) AppDynamics appagent monitoring Tomcat Monitoring Application war file, base servlet, platform, client Healthcheck, status servlets, JMX interface, Servo Log rotation to S3 interface jars, Astyanax autoscale AppDynamics machineagent GC and thread dump logging Epic/Atlas
  • 43. Astyanax Available at http://github.com/netflix • Features – Complete abstraction of connection pool from RPC protocol – Fluent Style API – Operation retry with backoff – Token aware • Recipes – Distributed row lock (without zookeeper) – Multi-DC row lock – Uniqueness constraint – Multi-row uniqueness constraint – Chunked and multi-threaded large file storage
  • 44. Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginating() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } } } catch (ConnectionException e) { }
  • 45. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Cassandra •Disks •Zone A 1. Client Writes to local Cassandra 3 2Cassandra If a node goes coordinator •Disks4 3•Disks 4 offline, hinted handoff 2. Coodinator writes to •Zone C 1 •Zone B completes the write 2 other zones Token when the node comes 3. Nodes return ack back up. 4. Data written to Aware internal commit log Clients Requests can choose to disks (no more than Cassandra Cassandra wait for one node, a 10 seconds later) •Disks •Disks quorum, or all nodes to •Zone B •Zone C ack the write Cassandra 3 SSTable disk writes and •Disks 4 compactions occur •Zone A asynchronously
  • 46. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum 1. Client writes to local replicas If a node or region goes offline, hinted handoff 2. Local write acks returned to completes the write when the node comes back up. Client which continues when Nightly global compare and repair jobs ensure 2 of 3 local nodes are everything stays consistent. committed 3. Local coordinator writes to remote coordinator. 100+ms latency Cassandra Cassandra 4. When data arrives, remote • Disks • Zone A • Disks • Zone A coordinator node acks and Cassandra 2 2 Cassandra Cassandra 4Cassandra 6 • Disks • Disks 6 3 5• Disks6 4 Disks6 copies to other remote zones • Zone C 1 • Zone B • Zone C • • Zone B 4 5. Remote nodes ack to local US EU coordinator Clients Clients Cassandra 2 Cassandra Cassandra Cassandra 6. Data flushed to internal • Disks • Zone B • Disks • Zone C 6 • Disks • Zone B • Disks • Zone C commit log disks (no more Cassandra 5 6Cassandra • Disks than 10 seconds later) • Zone A • Disks • Zone A
  • 47. Cassandra Instance Architecture Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Java (JDK 7) Healthcheck, Status AppDynamics appagent monitoring Cassandra Server Monitoring Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and AppDynamics GC and thread dump SSTables machineagent logging Epic/Atlas
  • 48. Priam – Cassandra Automation Available at http://github.com/netflix • Netflix Platform Tomcat Code • Zero touch auto-configuration • State management for Cassandra JVM • Token allocation and assignment • Broken node auto-replacement • Full and incremental backup to S3 • Restore sequencing from S3 • Grow/Shrink Cassandra “ring”
  • 49. ETL for Cassandra • Data is de-normalized over many clusters! • Too many to restore from backups for ETL • Solution – read backup files using Hadoop • Aegisthus – http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html – High throughput raw SSTable processing – Re-normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata
  • 50. Build Your Own Highly Available Platform @NetflixOSS
  • 52. 2013 Roadmap - highlights • Bakeries Build and Deploy • Workflow orchestration • Push-button Launcher • Sample applications Recipes • Karyon - Base server • More Monkeys, Error and Latency Injection Availability • Denominator (more later…) • Atlas - monitoring • Genie – Hadoop Paas Analytics • Explorers / visualization • EvCache – persister and memcached service Persistence • More Astyanax Recipes
  • 54. Three Questions Why is Netflix doing this? How does it all fit together? What is coming next?
  • 55. Netflix Deconstructed Content as a Service on a Platform Long term strategic Easy to use barriers to competition Personalized Service Exclusive Agile, Reliable and Scalable, Secure Extensive Low cost, Global Content Platform Enables the business, but doesn’t Netflix differentiate against large competitors
  • 56. Platform Evolution 2009-2010 2011-2012 2013-2014 Bleeding Edge Common Shared Innovation Pattern Pattern Netflix ended up several years ahead of the industry, but it’s not a sustainable position
  • 57. Making it easy to follow Exploring the wild west each time vs. laying down a shared route
  • 58. Establish our Hire, Retain and solutions as Best Engage Top Practices / Standards Engineers Goals Build up Netflix Benefit from a Technology Brand shared ecosystem
  • 59. Progress during 2012 From pushing the platform uphill to runaway success
  • 60. How does it all fit together?
  • 61. Our Current Catalog of Releases Free code available at http://netflix.github.com
  • 62. Open Source Projects Legend Github / Techblog Priam Exhibitor Servo and Autoscaling Scripts Apache Contributions Cassandra as a Service Zookeeper as a Service Astyanax Curator Genie Techblog Post Cassandra client for Java Zookeeper Patterns Hadoop PaaS Coming Soon CassJMeter EVCache Hystrix Cassandra test suite Memcached as a Service Robust service pattern Cassandra Eureka / Discovery Multi-region EC2 datastore RxJava Reactive Patterns support Service Directory Asgard Aegisthus Archaius AutoScaleGroup based AWS Hadoop ETL for Cassandra Dynamics Properties Service console Edda Chaos Monkey Explorers Config state with history Robustness verification Governator Denominator Library lifecycle and dependency Latency Monkey injection (Announce today) Odin Ribbon Janitor Monkey Orchestration for Asgard REST Client + mid-tier LB Karyon Blitz4j Async logging Bakeries and AMI Instrumented REST Base Server
  • 63. NetflixOSS Continuous Build and Deployment Github Maven AWS NetflixOSS Central Base AMI Source Dynaslave Jenkins AWS AWS Build Bakery Baked AMIs Slaves Odin Asgard AWS Orchestration (+ Frigga) Account API Console
  • 64. NetflixOSS Services Scope AWS Account Asgard Console Archaius Config Multiple AWS Regions Service Cross region Priam C* Eureka Registry Explorers Dashboards Exhibitor ZK 3 AWS Zones Application Priam Evcache Atlas Edda History Clusters Cassandra Memcached Monitoring Autoscale Groups Persistent Storage Ephemeral Storage Instances Simian Army Genie Hadoop Services
  • 65. NetflixOSS Instance Libraries • Baked AMI – Tomcat, Apache, your code Initialization • Governator – Guice based dependency injection • Archaius – dynamic configuration properties client • Eureka - service registration client Service • Karyon - Base Server for inbound requests • RxJava – Reactive pattern • Hystrix/Turbine – dependencies and real-time status Requests • Ribbon - REST Client for outbound calls • Astyanax – Cassandra client and pattern library Data Access • Evcache – Zone aware Memcached client • Curator – Zookeeper patterns • Blitz4j – non-blocking logging Logging • Servo – metrics export for autoscaling • Atlas – high volume instrumentation
  • 66. NetflixOSS Testing and Automation Test Tools • CassJmeter – load testing for C* • Circus Monkey – test rebalancing • Janitor Monkey Maintenance • Efficiency Monkey • Doctor Monkey • Howler Monkey • Chaos Monkey - Instances Availability • Chaos Gorilla – Availability Zones • Chaos Kong - Regions • Latency Monkey – latency and error injection Security • Security Monkey • Conformity Monkey
  • 67. What’s Coming Next? Better portability Higher availability More Features Easier to deploy Contributions from end users Contributions from vendors More Use Cases
  • 68. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering an ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  • 69. Takeaway Netflix has built and deployed a scalable global and highly available Platform as a Service. We encourage you to adopt or extend the NetflixOSS platform ecosystem. http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/adriancockcroft @adrianco #netflixcloud @NetflixOSS

Notes de l'éditeur

  1. Content, delivered by a service running on a platform. However our much large competitors also have the same platform advantages.
  2. When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
  3. The railroad made it possible for California to be developed quickly, by creating an easy to follow path we can create a much bigger ecosystem around the Netflix platform
  4. We have shared parts of our platform bit by bit through the year, it’s starting to get traction now
  5. The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.