SlideShare a Scribd company logo
1 of 42
Performance and Fault Tolerance
for the Netflix API
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




http://techblog.netflix.com/
Netflix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Netflix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Dozens of dependencies.

    One going bad takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
No single dependency should
 take down the entire app.

         Fallback.
         Fail silent.
          Fail fast.

        Shed load.
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker
Performance and Fault Tolerance for the Netflix API
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
30 rps x 0.2 seconds = 6 + breathing room = 10 threads

Thread-pool Queue size: 5-10 (0 doesn't work but get close to it)

                Thread-pool Size + Queue Size

                      Queuing is Not Free
Cost of Thread @ 75rps
  median - 90th - 99th (time in ms)




                 Time for thread to execute   Time user thread waited
Netflix DependencyCommand Implementation
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response
Netflix DependencyCommand Implementation
So, how does it work in the real world?
Visualizing Circuits in Near-Realtime
    (latency is single-digit seconds, generally 1-2)




        Video available at
  https://vimeo.com/33576628
Rolling 10 second counters


1 minute latency percentiles




  2 minute rate change



circle color and size represent
   health and traffic volume
API Daily Incoming vs Outgoing

Weekend                                        Weekend               Weekend




              8-10 Billion DependencyCommand Executions (threaded)




                        1.2 - 1.6 Billion Incoming Requests
API Hourly Incoming vs Outgoing

 Peak at 700M+ threaded DependencyCommand executions (200k+/second)




              Peak at 100M+ incoming requests (30k+/second)
Performance and Fault Tolerance for the Netflix API
Fallback.
Fail silent.
 Fail fast.

Shed load.
Netflix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Single Network Request from Clients
     (use LAN instead of WAN)

  Send Only The Bytes That Matter
 (optimize responses for each client)

       Leverage Concurrency
  (but abstract away its complexity)
Single Network Request from Clients
     (use LAN instead of WAN)




                     Device
                              Server


                                       Netflix API
      landing page requires
       ~dozen API requests
Single Network Request from Clients
        (use LAN instead of WAN)




some clients are limited in the number of
   concurrent network connections
Single Network Request from Clients
         (use LAN instead of WAN)




network latency makes this even worse
(mobile, home, wifi, geographic distance, etc)
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Netflix API




  push call pattern to server ...
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Netflix API




 ... and eliminate redundant calls
Send Only The Bytes That Matter
         (optimize responses for each client)




                                              Netflix API
                       Device
                                Server
Client                                   Client




           part of client now on server
Send Only The Bytes That Matter
              (optimize responses for each client)




                                                   Netflix API
                            Device
                                     Server
     Client                                   Client




client retrieves and delivers exactly what their
       device needs in its optimal format
Send Only The Bytes That Matter
            (optimize responses for each client)




                    Device
                             Server
                                               Netflix API

                                               Service Layer

Client                                Client




         interface is now a Java API that client
            interacts with at a granular level
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Netflix API

                                           Service Layer

Client                            Client
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Netflix API

                                           Service Layer

Client                            Client




   no synchronized, volatile, locks, Futures or
Atomic*/Concurrent* classes in client-server code
Leverage Concurrency
                         (but abstract away its complexity)

Service calls are    def video1Call = api.getVideos(api.getUser(), 123456, 7891234);
all asynchronous     def video2Call = api.getVideos(api.getUser(), 6789543);

                     // higher-order functions used to compose asynchronous calls together
                     wx.merge(video1Call, video2Call).toList().subscribe([
    Functional
                         onNext: {
  programming                listOfVideos ->
 with higher-order           for(video in listOfVideos) {
     functions                   response.getWriter().println("video: " + video.id + " " + video.title);
                             }
                         },
                         onError: {
                             exception ->
                             response.setStatus(500);
                             response.getWriter().println("Error: " + exception.getMessage());
                         }
                     ])




              Fully asynchronous API - Clients can’t block
Device
                             Server

                                      Netflix API




Optimize for each device. Leverage the server.
Netflix is Hiring
                              http://jobs.netflix.com




Fault Tolerance in a High Volume, Distributed System
     http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



          Making the Netflix API More Resilient
    http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html



              Why REST Keeps Me Up At Night
 http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/



                         Ben Christensen
                           @benjchristensen
                  http://www.linkedin.com/in/benjchristensen

More Related Content

Similar to Performance and Fault Tolerance for the Netflix API

The Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side ScriptingThe Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side ScriptingKatharina Probst
 
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInventPros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInventSudhir Tonse
 
Building a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptxBuilding a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptxPINGXIONG3
 
Application DoS In Microservice Architectures
Application DoS In Microservice ArchitecturesApplication DoS In Microservice Architectures
Application DoS In Microservice ArchitecturesScott Behrens
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfQConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfSimranjyotSuri
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesJosh Evans
 
Azure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservicesAzure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservicesMicrosoft Tech Community
 
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Amazon Web Services
 
Developing reliable applications with .net core and AKS
Developing reliable applications with .net core and AKSDeveloping reliable applications with .net core and AKS
Developing reliable applications with .net core and AKSAlessandro Melchiori
 
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18CodeOps Technologies LLP
 
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Twilio Inc
 
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)Stuart Lodge
 
Jeffrey Richter
Jeffrey RichterJeffrey Richter
Jeffrey RichterCodeFest
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical ApproachMadhaiyan Muthu
 
REST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side DevelopmentREST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side DevelopmentHyunghun Cho
 
Windows Azure架构探析
Windows Azure架构探析Windows Azure架构探析
Windows Azure架构探析George Ang
 
Am 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-finalAm 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-finalOpenCity Community
 

Similar to Performance and Fault Tolerance for the Netflix API (20)

The Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side ScriptingThe Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side Scripting
 
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInventPros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
 
Building a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptxBuilding a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptx
 
Application DoS In Microservice Architectures
Application DoS In Microservice ArchitecturesApplication DoS In Microservice Architectures
Application DoS In Microservice Architectures
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfQConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
 
Azure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservicesAzure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservices
 
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
 
Developing reliable applications with .net core and AKS
Developing reliable applications with .net core and AKSDeveloping reliable applications with .net core and AKS
Developing reliable applications with .net core and AKS
 
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
 
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
 
Net Services
Net ServicesNet Services
Net Services
 
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
 
Jeffrey Richter
Jeffrey RichterJeffrey Richter
Jeffrey Richter
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical Approach
 
REST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side DevelopmentREST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side Development
 
Windows Azure架构探析
Windows Azure架构探析Windows Azure架构探析
Windows Azure架构探析
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
Am 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-finalAm 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-final
 

Recently uploaded

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 

Recently uploaded (20)

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 

Performance and Fault Tolerance for the Netflix API

  • 1. Performance and Fault Tolerance for the Netflix API Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/
  • 2. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 3. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 4. Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse.
  • 8. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load.
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
  • 11. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve”
  • 16. 30 rps x 0.2 seconds = 6 + breathing room = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free
  • 17. Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited
  • 19. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response
  • 21. So, how does it work in the real world?
  • 22. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628
  • 23. Rolling 10 second counters 1 minute latency percentiles 2 minute rate change circle color and size represent health and traffic volume
  • 24. API Daily Incoming vs Outgoing Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests
  • 25. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second)
  • 27. Fallback. Fail silent. Fail fast. Shed load.
  • 28. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 29. Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity)
  • 30. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API landing page requires ~dozen API requests
  • 31. Single Network Request from Clients (use LAN instead of WAN) some clients are limited in the number of concurrent network connections
  • 32. Single Network Request from Clients (use LAN instead of WAN) network latency makes this even worse (mobile, home, wifi, geographic distance, etc)
  • 33. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API push call pattern to server ...
  • 34. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API ... and eliminate redundant calls
  • 35. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Client part of client now on server
  • 36. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Client client retrieves and delivers exactly what their device needs in its optimal format
  • 37. Send Only The Bytes That Matter (optimize responses for each client) Device Server Netflix API Service Layer Client Client interface is now a Java API that client interacts with at a granular level
  • 38. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service Layer Client Client
  • 39. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service Layer Client Client no synchronized, volatile, locks, Futures or Atomic*/Concurrent* classes in client-server code
  • 40. Leverage Concurrency (but abstract away its complexity) Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234); all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can’t block
  • 41. Device Server Netflix API Optimize for each device. Leverage the server.
  • 42. Netflix is Hiring http://jobs.netflix.com Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen

Editor's Notes

  1. \n
  2. The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. \n\nMore than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second. \n
  3. First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies. \n
  4. Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience. \n
  5. \n
  6. \n
  7. \n
  8. It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server. \n\nEach execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values. \n\nThe calling thread median, 90th and 99th percentiles are the last 3 legend values. \n\nThus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same. \n
  29. Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API. \n
  30. The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page. \n
  31. \n
  32. \n
  33. The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently. \n
  34. \n
  35. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  36. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  37. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  38. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  39. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  40. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  41. The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.\n
  42. \n