SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
Performance and Fault Tolerance
for the Netflix API
Silicon Valley Cloud Computing Group - July 18 2012

Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




http://techblog.netflix.com/


                                                      1
Netflix API




                                                    Dependency A                 Dependency B                 Dependency C



                                                         Dependency D                 Dependency E                 Dependency F



                                                              Dependency G                 Dependency H                  Dependency I



                                                                    Dependency J                Dependency K                 Dependency L



                                                                        Dependency M                 Dependency N                 Dependency O



                                                                             Dependency P                 Dependency Q                  Dependency R




                                                                                                                                                                                                2
The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming.

More than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over
200k dependency requests per second.
Netflix API




                                                    Dependency A                 Dependency B                Dependency C



                                                         Dependency D                 Dependency E                Dependency F



                                                              Dependency G                Dependency H                  Dependency I



                                                                   Dependency J                 Dependency K                Dependency L



                                                                        Dependency M                Dependency N                 Dependency O



                                                                             Dependency P                Dependency Q                  Dependency R




                                                                                                                                                                     3
First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies.
Dozens of dependencies.

                                  One going bad takes everything down.


                   99.99%30                                                = 99.7% uptime
                                     0.3% of 1 billion = 3,000,000 failures

                                 2+ hours downtime/month
                     even if all dependencies have excellent uptime.

                                                     Reality is generally worse.


                                                                                                                                                                                   4
Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not
engineered for resilience.
5
6
7
Latency is far worse for system resilience than failure. Failures naturally “fail fast” and shed load whereas latency backs up queues, threads and system resources and if isolation techniques
are not used it can cause an entire system to fail.
> 80% of requests rejected




                         Median
                         Latency

                           [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1)


"Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
- locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.(Socket.java:425)
at java.net.Socket.(Socket.java:280)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158)
at java.lang.Thread.run(Thread.java:722)

                                                                                                                                                                                        8
This is an example of what a system looks like when high latency occurs without load shedding and isolation. Backend latency spiked (from <100ms to >1000ms at the median, >10,000 at
the 90th percentile) and saturated all available resources resulting in the HTTP layer rejecting over 80% of requests.
No single dependency should
                                               take down the entire app.

                                                                                  Fallback.
                                                                                  Fail silent.
                                                                                   Fail fast.

                                                                                Shed load.



                                                                                                                                                                                                           9
It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture. Infrastructure is an aspect of resilience engineering but it can not be
relied upon by itself - software must be resilient.
10
Netflix uses a combination of aggressive network timeouts, tryable semaphores and thread pools to isolate dependencies and limit impact of both failure and
latency.
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”



                                                     11
12
With isolation techniques the application container is now segmented according to how it uses its underlying dependencies instead of using a single shared resource pool to communicate
with all of them.
13
A single dependency failing will no longer be permitted to take more resources than it was allocated and can have its impact
controlled.
14
In this case the backend service has become latent and saturates all available threads allocated to it so further requests to it are rejected (the orange line) instead of blocking or using up all
available system threads.
15
30 rps x 0.2 seconds = 6 + breathing room = 10 threads

Thread-pool Queue size: 5-10 (0 doesn't work but get close to it)

                Thread-pool Size + Queue Size

                      Queuing is Not Free




                                                                    16
Cost of Thread @ ~60rps
                                                    mean - median - 90th - 99th (time in ms)




                                                                                   Time for thread to execute                    Time user thread waited


                                                                                                                                                                                        17
The Netflix API has ~30 thread pools with 5-20 threads in each. A common question and concern is what impact this has on performance.

Here is a sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 60rps per server.

Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread
waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future.

This example was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms
or higher.
Cost of Thread @ ~60rps
                                                    mean - median - 90th - 99th (time in ms)




                   Cost: 0ms                                       Time for thread to execute   Time user thread waited


                                                                                                                          18
At the median (and lower) there is no cost to having a separate
thread.
Cost of Thread @ ~60rps
                                                     mean - median - 90th - 99th (time in ms)




                   Cost: 3ms                                          Time for thread to execute   Time user thread waited


                                                                                                                             19
At the 90th percentile there is a cost of 3ms for having a separate
thread.
Cost of Thread @ ~60rps
                                                    mean - median - 90th - 99th (time in ms)




                   Cost: 9ms                                                       Time for thread to execute                        Time user thread waited


                                                                                                                                                                                               20
At the 99th percentile there is a cost of 9ms for having a separate thread. Note however that the increase in cost is far smaller than the increase in execution time of the separate thread
which jumped from 2 to 28 whereas the cost jumped from 0 to 9.

This overhead at the 90th percentile and higher for circuits such as these has been deemed acceptable for the benefits of resilience achieved.

For circuits that wrap very low latency requests (such as those primarily hitting in-memory caches) the overhead can be too high and in those cases we choose to use tryable semaphores
which do not allow for timeouts but provide most of the resilience benefits without the overhead. The overhead in general is small enough that we prefer the isolation benefits of a separate
thread.
Cost of Thread @ ~75rps
                                                    mean - median - 90th - 99th (time in ms)




                                                                                  Time for thread to execute                       Time user thread waited


                                                                                                                                                                                            21
This is a second sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 75rps per server.

As with the first example this was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network
latency was 100ms or higher.

Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread
waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future.
Cost of Thread @ ~75rps
                                                    mean - median - 90th - 99th (time in ms)




                   Cost: 0ms                                       Time for thread to execute   Time user thread waited


                                                                                                                          22
At the median (and lower) there is no cost to having a separate
thread.
Cost of Thread @ ~75rps
                                                     mean - median - 90th - 99th (time in ms)




                   Cost: 2ms                                          Time for thread to execute   Time user thread waited


                                                                                                                             23
At the 90th percentile there is a cost of 2ms for having a separate
thread.
Cost of Thread @ ~75rps
                                                     mean - median - 90th - 99th (time in ms)




                   Cost: 2ms                                          Time for thread to execute   Time user thread waited


                                                                                                                             24
At the 99th percentile there is a cost of 2ms for having a separate
thread.
Netflix DependencyCommand Implementation




                                          25
Netflix DependencyCommand Implementation
(1) Construct DependencyCommand Object
On each dependency invocation its DependencyCommand object will be constructed with the arguments necessary to make the call to the server.
For example:
      DependencyCommand command = new DependencyCommand(arg1, arg2)
(2) Execution Synchronously or Asynchronously
Execution of the command can then be performed synchronously or asychronously:
      K value = command.execute()
      Future<K> value = command.queue()
The synchronous call execute() invokes queue().get() unless the command is specified to not run in a thread.
(3) Is Circuit Open?
Upon execution of the command it first checks with the circuit-breaker to ask "is the circuit open?".
If the circuit is open (tripped) then the command will not be executed and flow routed to (8) DependencyCommand.getFallback().
If the circuit is closed then the command will be executed and flow continue to (5) DependencyCommand.run().
(4) Is Thread Pool/Queue Full?
If the thread-pool and queue associated with the command is full then the execution will be rejected and immediately routed through fallback (8).
If the command does not run within a thread then this logic will be skipped.
(5) DependencyCommand.run()
The concrete implementation run() method is executed.
(5a) Command Timeout
The run() method occurs within a thread with a timeout and if it takes too long the thread will throw a TimeoutException. In that case the response is routed
through fallback (8) and the eventual run() method response is discarded.
If the command does not run within a thread then this logic will not be applicable.




                                                                                                                                                                26
Netflix DependencyCommand Implementation
(6) Is Command Successful?
Application flow is routed based on the response from the run() method.
(6a) Successful Response
If no exceptions are thrown and a response is returned (including a null value) then it proceeds to return the response after some logging and a performance
check.
(6b) Failed Response
When a response throws an exception it will mark it as "failed" which will contribute to potentially tripping the circuit open and it will route application flow to (8)
DependencyCommand.getFallback().
(7) Calculate Circuit Health
Successes, failures, rejections and timeouts are all reported to the circuit breaker to maintain a rolling set of counters which calculate statistics.
These stats are then used to determine when the circuit should "trip" and become open at which point subsequent requests are short-circuited until a period of
time passes and requests are permitted again after health checks succeed.
(8) DependencyCommand.getFallback()
The fallback is performed whenever a command execution fails (an exception is thrown by (5) DependencyCommand.run()) or when it is (3) short-circuited
because the circuit is open.
The intent of the fallback is to provide a generic response without any network dependency from an in-memory cache or other static logic.
(8a) Fallback Not Implemented
If DependencyCommand.getFallback() is not implemented then an exception with be thrown and the caller left to deal with it.
(8b) Fallback Successful
If the fallback returns a response then it will be returned to the caller.
(8c) Fallback Failed
If DependencyCommand.getFallback() fails and throws an exception then the caller is left to deal with it.
This is considered a poor practice to have a fallback implementation that can fail. A fallback should be implemented such that it is not performing any logic that
would fail. Semaphores are wrapped around fallback execution to protect against software bugs that do not comply with this principle, particular if the fallback itself
tries to perform a network call that can be latent.
(9) Return Successful Response
If (6a) occurred the successful response will be returned to the caller regardless of whether it was latent or not.


                                                                                                                                                                           27
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response




                                          28
Netflix DependencyCommand Implementation




                                          29
So, how does it work in the real world?




                                          30
Visualizing Circuits in Near-Realtime
                                                  (latency is single-digit seconds, generally 1-2)




                                                                                                                                                                                  31
This is an example of our monitoring system which provides low-latency (1-2 seconds typically) visibility into the traffic and health of all DependencyCommand circuits across a
cluster.
circle color and size represent                                                   Error percentage of
   health and traffic volume                                                         last 10 second


                                                                                         Request rate



  2 minutes of request rate to
 show relative changes in traffic                                                       Circuit-breaker
                                                                                           status


     hosts reporting from cluster
                                                                      last minute latency percentiles


                                         Rolling 10 second counters
                                          with 1 second granularity
                                   Successes                 Thread timeouts
                     Short-circuited (rejected)              Thread-pool Rejections
                                                             Failures/Exceptions



                                                                                                         32
API Daily Incoming vs Outgoing

Weekend                                        Weekend               Weekend




              8-10 Billion DependencyCommand Executions (threaded)




                        1.2 - 1.6 Billion Incoming Requests




                                                                               33
API Hourly Incoming vs Outgoing

  Peak at 200k+ threaded DependencyCommand executions/second




               Peak at 30k+ incoming requests/second




                                                               34
35
This view of the dashboard was captured during a latency monkey simulation to test resilience against latency (http://techblog.netflix.com/2011/07/netflix-simian-army.html) and shows
how several of the DependencyCommands degraded in health and showed timeouts, threadpool rejections, short-circuiting and failures.

The DependencyCommands of dependencies not affected by latency were unaffected.

During this test no users were prevented from using Netflix on any devices. Instead fallbacks and graceful degradation occurred and as soon as latency was removed all systems returned
to health within seconds.
36
This was another latency monkey simulation that affected a single
DependencyCommand.
Latency spikes from ~30ms median to first 2000+ then 10000+ ms




    Success drops off, Timeouts and Short Circuiting shed load



                                                               Peak at 100M+ incoming requests (30k+/second)




                                                                                                                                                                                  37
These graphs show the full duration of a latency monkey simulation (and look similar to real production events) when latency occurred and the DependencyCommand timed-out and short-
circuited the requests and returned fallbacks.
38
Fallback.
Fail silent.
 Fail fast.

Shed load.




               39
Netflix API




                                                Dependency A                Dependency B                Dependency C



                                                    Dependency D                Dependency E                 Dependency F



                                                         Dependency G                Dependency H                 Dependency I



                                                               Dependency J               Dependency K                Dependency L



                                                                   Dependency M                Dependency N                Dependency O



                                                                        Dependency P                Dependency Q                 Dependency R




                                                                                                                                                                                            40
Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices
the same.
Single Network Request from Clients
                                                       (use LAN instead of WAN)




                                                                                                       Device
                                                                                                                Server


                                                                                                                         Netflix API
                                                          landing page requires
                                                           ~dozen API requests

                                                                                                                                      41
The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page.
Single Network Request from Clients
        (use LAN instead of WAN)




some clients are limited in the number of
   concurrent network connections

                                            42
Single Network Request from Clients
         (use LAN instead of WAN)




network latency makes this even worse
(mobile, home, wifi, geographic distance, etc)

                                                43
Single Network Request from Clients
                                                      (use LAN instead of WAN)




                                                                                         Device
                                                                                         Server

                                                                                                  Netflix API




                                            push call pattern to server ...

                                                                                                                                                                   44
The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more
efficiently.
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Netflix API




 ... and eliminate redundant calls

                                      45
46
Send Only The Bytes That Matter
                                           (optimize responses for each client)




                                                                                                            Netflix API
                                                                                     Device
                                                                                              Server
                                  Client                                                               Client




                                                part of client now on server
                                                                                                                                                                                               47
The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a
Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience.
Send Only The Bytes That Matter
              (optimize responses for each client)




                                                   Netflix API
                            Device
                                     Server
     Client                                   Client




client retrieves and delivers exactly what their
       device needs in its optimal format
                                                                48
Send Only The Bytes That Matter
            (optimize responses for each client)




                    Device
                             Server
                                               Netflix API

                                               Service Layer

Client                                Client




         interface is now a Java API that client
            interacts with at a granular level

                                                               49
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Netflix API

                                           Service Layer

Client                            Client




                                                           50
Leverage Concurrency
                                         (but abstract away its complexity)




                                                              Device
                                                                       Server
                                                                                                                   Netflix API

                                                                                                                  Service Layer

         Client                                                                 Client




            no synchronized, volatile, locks, Futures or
         Atomic*/Concurrent* classes in client-server code
                                                                                                                                                                                51
Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take,
toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style.
Functional Reactive Programming
                      composable asynchronous functions

Service calls are    def video1Call = api.getVideos(api.getUser(), 123456, 7891234);
all asynchronous     def video2Call = api.getVideos(api.getUser(), 6789543);

                     // higher-order functions used to compose asynchronous calls together
                     wx.merge(video1Call, video2Call).toList().subscribe([
    Functional
                         onNext: {
  programming                listOfVideos ->
 with higher-order           for(video in listOfVideos) {
     functions                   response.getWriter().println("video: " + video.id + " " + video.title);
                             }
                         },
                         onError: {
                             exception ->
                             response.setStatus(500);
                             response.getWriter().println("Error: " + exception.getMessage());
                         }
                     ])




              Fully asynchronous API - Clients can’t block
                                                                                                       52
Request Collapsing
                                                                    batch don’t burst




                                                                                                                                                                                           53
The DependencyCommand resilience layer is leveraged for concurrency including optimizations such as request collapsing (automated batching) which bundles bursts of calls to the same
service into batches without the client code needing to understand or manually optimize for batching. This is particularly important when client code becomes highly concurrent and data is
requested in multiple different code paths sometimes written by different engineers. Request collapsing automatically captures and batches the calls together. The collapsing functionality
also supports sharded architectures so a batch of requests can be sharded into sub-batches if the client-server relationship requires requests to be routed to a sharded backend.
Request Collapsing
                                                                batch don’t burst




                      100:1 collapsing ratio (batch size of ~100)
                                                                                               54
This graph shows an extreme example of a dependency where we collapse requests at a ratio of
100:1
Request Collapsing
                                                                    batch don’t burst




                                       4000 rps instead of 400,000 rps




                       100:1 collapsing ratio (batch size of ~100)
                                                                                                                 55
This is the same graph but on a power scale instead of linear so the blue line (actual network requests) shows
up.
Request Scoped Caching
                                            short-lived and concurrency aware




                                                                                                                                                                                               56
Another use of the DependencyCommand layer is to allow client code to perform requests without concern of duplicate network calls due to concurrency.

The Futures is atomically cached using “putIfAbsent” in the request scope shared via ThreadLocals of each thread so clients can request data in multiple code paths without inefficiency concerns.
Request Caching
                                                                               stateless




                                                                                                                                                                                    57
Some examples of request caching de-duplicating backend calls. On some the impact is reasonably high while on most it is a small percentage or none at all but overall provided a
measurable drop in network calls and in some use cases for client code significantly improved latency by eliminating unnecessary network calls.
Device
                                                                                               Server

                                                                                                        Netflix API




               Optimize for each device. Leverage the server.
                                                                                                                                                                58
The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and
devices.
Netflix is Hiring
                                 http://jobs.netflix.com




   Fault Tolerance in a High Volume, Distributed System
       http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



             Making the Netflix API More Resilient
      http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html



Embracing the Differences : Inside the Netflix API Redesign
     http://techblog.netflix.com/2012/07/embracing-differences-inside-netflix.html



                            Ben Christensen
                             @benjchristensen
                     http://www.linkedin.com/in/benjchristensen
                                                                                   59

Contenu connexe

Dernier

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Dernier (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Performance and Fault Tolerance for the Netflix API - July 18 2012

  • 1. Performance and Fault Tolerance for the Netflix API Silicon Valley Cloud Computing Group - July 18 2012 Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/ 1
  • 2. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R 2 The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. More than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over 200k dependency requests per second.
  • 3. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R 3 First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies.
  • 4. Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 4 Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience.
  • 5. 5
  • 6. 6
  • 7. 7 Latency is far worse for system resilience than failure. Failures naturally “fail fast” and shed load whereas latency backs up queues, threads and system resources and if isolation techniques are not used it can cause an entire system to fail.
  • 8. > 80% of requests rejected Median Latency [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) 8 This is an example of what a system looks like when high latency occurs without load shedding and isolation. Backend latency spiked (from <100ms to >1000ms at the median, >10,000 at the 90th percentile) and saturated all available resources resulting in the HTTP layer rejecting over 80% of requests.
  • 9. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load. 9 It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture. Infrastructure is an aspect of resilience engineering but it can not be relied upon by itself - software must be resilient.
  • 10. 10 Netflix uses a combination of aggressive network timeouts, tryable semaphores and thread pools to isolate dependencies and limit impact of both failure and latency.
  • 11. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 11
  • 12. 12 With isolation techniques the application container is now segmented according to how it uses its underlying dependencies instead of using a single shared resource pool to communicate with all of them.
  • 13. 13 A single dependency failing will no longer be permitted to take more resources than it was allocated and can have its impact controlled.
  • 14. 14 In this case the backend service has become latent and saturates all available threads allocated to it so further requests to it are rejected (the orange line) instead of blocking or using up all available system threads.
  • 15. 15
  • 16. 30 rps x 0.2 seconds = 6 + breathing room = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free 16
  • 17. Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited 17 The Netflix API has ~30 thread pools with 5-20 threads in each. A common question and concern is what impact this has on performance. Here is a sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 60rps per server. Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future. This example was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher.
  • 18. Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Cost: 0ms Time for thread to execute Time user thread waited 18 At the median (and lower) there is no cost to having a separate thread.
  • 19. Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Cost: 3ms Time for thread to execute Time user thread waited 19 At the 90th percentile there is a cost of 3ms for having a separate thread.
  • 20. Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Cost: 9ms Time for thread to execute Time user thread waited 20 At the 99th percentile there is a cost of 9ms for having a separate thread. Note however that the increase in cost is far smaller than the increase in execution time of the separate thread which jumped from 2 to 28 whereas the cost jumped from 0 to 9. This overhead at the 90th percentile and higher for circuits such as these has been deemed acceptable for the benefits of resilience achieved. For circuits that wrap very low latency requests (such as those primarily hitting in-memory caches) the overhead can be too high and in those cases we choose to use tryable semaphores which do not allow for timeouts but provide most of the resilience benefits without the overhead. The overhead in general is small enough that we prefer the isolation benefits of a separate thread.
  • 21. Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited 21 This is a second sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 75rps per server. As with the first example this was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher. Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future.
  • 22. Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Cost: 0ms Time for thread to execute Time user thread waited 22 At the median (and lower) there is no cost to having a separate thread.
  • 23. Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Cost: 2ms Time for thread to execute Time user thread waited 23 At the 90th percentile there is a cost of 2ms for having a separate thread.
  • 24. Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Cost: 2ms Time for thread to execute Time user thread waited 24 At the 99th percentile there is a cost of 2ms for having a separate thread.
  • 26. Netflix DependencyCommand Implementation (1) Construct DependencyCommand Object On each dependency invocation its DependencyCommand object will be constructed with the arguments necessary to make the call to the server. For example: DependencyCommand command = new DependencyCommand(arg1, arg2) (2) Execution Synchronously or Asynchronously Execution of the command can then be performed synchronously or asychronously: K value = command.execute() Future<K> value = command.queue() The synchronous call execute() invokes queue().get() unless the command is specified to not run in a thread. (3) Is Circuit Open? Upon execution of the command it first checks with the circuit-breaker to ask "is the circuit open?". If the circuit is open (tripped) then the command will not be executed and flow routed to (8) DependencyCommand.getFallback(). If the circuit is closed then the command will be executed and flow continue to (5) DependencyCommand.run(). (4) Is Thread Pool/Queue Full? If the thread-pool and queue associated with the command is full then the execution will be rejected and immediately routed through fallback (8). If the command does not run within a thread then this logic will be skipped. (5) DependencyCommand.run() The concrete implementation run() method is executed. (5a) Command Timeout The run() method occurs within a thread with a timeout and if it takes too long the thread will throw a TimeoutException. In that case the response is routed through fallback (8) and the eventual run() method response is discarded. If the command does not run within a thread then this logic will not be applicable. 26
  • 27. Netflix DependencyCommand Implementation (6) Is Command Successful? Application flow is routed based on the response from the run() method. (6a) Successful Response If no exceptions are thrown and a response is returned (including a null value) then it proceeds to return the response after some logging and a performance check. (6b) Failed Response When a response throws an exception it will mark it as "failed" which will contribute to potentially tripping the circuit open and it will route application flow to (8) DependencyCommand.getFallback(). (7) Calculate Circuit Health Successes, failures, rejections and timeouts are all reported to the circuit breaker to maintain a rolling set of counters which calculate statistics. These stats are then used to determine when the circuit should "trip" and become open at which point subsequent requests are short-circuited until a period of time passes and requests are permitted again after health checks succeed. (8) DependencyCommand.getFallback() The fallback is performed whenever a command execution fails (an exception is thrown by (5) DependencyCommand.run()) or when it is (3) short-circuited because the circuit is open. The intent of the fallback is to provide a generic response without any network dependency from an in-memory cache or other static logic. (8a) Fallback Not Implemented If DependencyCommand.getFallback() is not implemented then an exception with be thrown and the caller left to deal with it. (8b) Fallback Successful If the fallback returns a response then it will be returned to the caller. (8c) Fallback Failed If DependencyCommand.getFallback() fails and throws an exception then the caller is left to deal with it. This is considered a poor practice to have a fallback implementation that can fail. A fallback should be implemented such that it is not performing any logic that would fail. Semaphores are wrapped around fallback execution to protect against software bugs that do not comply with this principle, particular if the fallback itself tries to perform a network call that can be latent. (9) Return Successful Response If (6a) occurred the successful response will be returned to the caller regardless of whether it was latent or not. 27
  • 28. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 28
  • 30. So, how does it work in the real world? 30
  • 31. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) 31 This is an example of our monitoring system which provides low-latency (1-2 seconds typically) visibility into the traffic and health of all DependencyCommand circuits across a cluster.
  • 32. circle color and size represent Error percentage of health and traffic volume last 10 second Request rate 2 minutes of request rate to show relative changes in traffic Circuit-breaker status hosts reporting from cluster last minute latency percentiles Rolling 10 second counters with 1 second granularity Successes Thread timeouts Short-circuited (rejected) Thread-pool Rejections Failures/Exceptions 32
  • 33. API Daily Incoming vs Outgoing Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests 33
  • 34. API Hourly Incoming vs Outgoing Peak at 200k+ threaded DependencyCommand executions/second Peak at 30k+ incoming requests/second 34
  • 35. 35 This view of the dashboard was captured during a latency monkey simulation to test resilience against latency (http://techblog.netflix.com/2011/07/netflix-simian-army.html) and shows how several of the DependencyCommands degraded in health and showed timeouts, threadpool rejections, short-circuiting and failures. The DependencyCommands of dependencies not affected by latency were unaffected. During this test no users were prevented from using Netflix on any devices. Instead fallbacks and graceful degradation occurred and as soon as latency was removed all systems returned to health within seconds.
  • 36. 36 This was another latency monkey simulation that affected a single DependencyCommand.
  • 37. Latency spikes from ~30ms median to first 2000+ then 10000+ ms Success drops off, Timeouts and Short Circuiting shed load Peak at 100M+ incoming requests (30k+/second) 37 These graphs show the full duration of a latency monkey simulation (and look similar to real production events) when latency occurred and the DependencyCommand timed-out and short- circuited the requests and returned fallbacks.
  • 38. 38
  • 39. Fallback. Fail silent. Fail fast. Shed load. 39
  • 40. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R 40 Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same.
  • 41. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API landing page requires ~dozen API requests 41 The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page.
  • 42. Single Network Request from Clients (use LAN instead of WAN) some clients are limited in the number of concurrent network connections 42
  • 43. Single Network Request from Clients (use LAN instead of WAN) network latency makes this even worse (mobile, home, wifi, geographic distance, etc) 43
  • 44. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API push call pattern to server ... 44 The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently.
  • 45. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API ... and eliminate redundant calls 45
  • 46. 46
  • 47. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Client part of client now on server 47 The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience.
  • 48. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Client client retrieves and delivers exactly what their device needs in its optimal format 48
  • 49. Send Only The Bytes That Matter (optimize responses for each client) Device Server Netflix API Service Layer Client Client interface is now a Java API that client interacts with at a granular level 49
  • 50. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service Layer Client Client 50
  • 51. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service Layer Client Client no synchronized, volatile, locks, Futures or Atomic*/Concurrent* classes in client-server code 51 Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style.
  • 52. Functional Reactive Programming composable asynchronous functions Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234); all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can’t block 52
  • 53. Request Collapsing batch don’t burst 53 The DependencyCommand resilience layer is leveraged for concurrency including optimizations such as request collapsing (automated batching) which bundles bursts of calls to the same service into batches without the client code needing to understand or manually optimize for batching. This is particularly important when client code becomes highly concurrent and data is requested in multiple different code paths sometimes written by different engineers. Request collapsing automatically captures and batches the calls together. The collapsing functionality also supports sharded architectures so a batch of requests can be sharded into sub-batches if the client-server relationship requires requests to be routed to a sharded backend.
  • 54. Request Collapsing batch don’t burst 100:1 collapsing ratio (batch size of ~100) 54 This graph shows an extreme example of a dependency where we collapse requests at a ratio of 100:1
  • 55. Request Collapsing batch don’t burst 4000 rps instead of 400,000 rps 100:1 collapsing ratio (batch size of ~100) 55 This is the same graph but on a power scale instead of linear so the blue line (actual network requests) shows up.
  • 56. Request Scoped Caching short-lived and concurrency aware 56 Another use of the DependencyCommand layer is to allow client code to perform requests without concern of duplicate network calls due to concurrency. The Futures is atomically cached using “putIfAbsent” in the request scope shared via ThreadLocals of each thread so clients can request data in multiple code paths without inefficiency concerns.
  • 57. Request Caching stateless 57 Some examples of request caching de-duplicating backend calls. On some the impact is reasonably high while on most it is a small percentage or none at all but overall provided a measurable drop in network calls and in some use cases for client code significantly improved latency by eliminating unnecessary network calls.
  • 58. Device Server Netflix API Optimize for each device. Leverage the server. 58 The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.
  • 59. Netflix is Hiring http://jobs.netflix.com Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Embracing the Differences : Inside the Netflix API Redesign http://techblog.netflix.com/2012/07/embracing-differences-inside-netflix.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 59