SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




                                             1
Dozens of dependencies.

   One going down takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.


                                                  2
3
4
5
No single dependency should
 take down the entire app.

          Fail fast.
         Fail silent.
         Fallback.

        Shed load.



                              6
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              7
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              8
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              9
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 10
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 11
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 12
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              13
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             14
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             15
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             16
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          17
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          18
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          19
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              20
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   21
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   22
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   23
Netflix uses all 4 in combination




                                   24
25
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”



                                                     26
27
28
29
Benefits of Separate Threads

         Protection from client libraries

    Lower risk to accept new/updated clients

           Quick recovery from failure

             Client misconfiguration

Client service performance characteristic changes

              Built-in concurrency
                                                    30
Drawbacks of Separate Threads

    Some computational overhead

Load on machine can be pushed too far

                 ...

    Benefits outweigh drawbacks
    when clients are “untrusted”


                                        31
32
Visualizing Circuits in Realtime
      (generally sub-second latency)




      Video available at
https://vimeo.com/33576628




                                       33
Rolling 10 second counter – 1 second granularity

      Median Mean 90th 99th 99.5th

       Latent Error Timeout Rejected

                 Error Percentage
            (error+timeout+rejected)/
(success+latent success+error+timeout+rejected).

                                                   34
Netflix DependencyCommand Implementation




                                          35
Netflix DependencyCommand Implementation




                                          36
Netflix DependencyCommand Implementation




                                          37
Netflix DependencyCommand Implementation




                                          38
Netflix DependencyCommand Implementation




                                          39
Netflix DependencyCommand Implementation




                                          40
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response




                                          41
Netflix DependencyCommand Implementation




                                          42
Netflix DependencyCommand Implementation




                                          43
Rolling Number
Realtime Stats and
 Decision Making




                     44
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    45
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    46
47
Fail fast.
Fail silent.
Fallback.

Shed load.




               48
Questions & More Information


Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



         Making the Netflix API More Resilient
   http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html




                        Ben Christensen
                          @benjchristensen
                 http://www.linkedin.com/in/benjchristensen



                                                                              49

Contenu connexe

Tendances

Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Leejaxconf
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能maclean liu
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Timote Lima
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databaseRiyaj Shamsudeen
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesRiyaj Shamsudeen
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoductionRiyaj Shamsudeen
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionFITC
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Doug Burns
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commandsnanker phelge
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321maclean liu
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel ExecutionDoug Burns
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj Shamsudeen
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Guillaume Laforge
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档YUCHENG HU
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGuillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN Riyaj Shamsudeen
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3Saurav Basu
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupJorge Nerín
 

Tendances (20)

Virtual machine re building
Virtual machine re buildingVirtual machine re building
Virtual machine re building
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle database
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issues
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework Evolution
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commands
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel Execution
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focus
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user group
 

Similaire à Fault Tolerance in a High Volume, Distributed System

Fork and join framework
Fork and join frameworkFork and join framework
Fork and join frameworkMinh Tran
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsCarol McDonald
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in JavaRuben Inoto Soto
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.wellD
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Michele Giacobazzi
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsRiccardo Cardin
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)aragozin
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizersts_v_murthy
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Royekino
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvmaragozin
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java ApplicationsEberhard Wolff
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrencypriyank09
 

Similaire à Fault Tolerance in a High Volume, Distributed System (20)

JavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan JuričićJavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan Juričić
 
Fork and join framework
Fork and join frameworkFork and join framework
Fork and join framework
 
Curator intro
Curator introCurator intro
Curator intro
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and Trends
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in Java
 
Resilience mit Hystrix
Resilience mit HystrixResilience mit Hystrix
Resilience mit Hystrix
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Concurrency-5.pdf
Concurrency-5.pdfConcurrency-5.pdf
Concurrency-5.pdf
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced concepts
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)
 
Java Concurrency
Java ConcurrencyJava Concurrency
Java Concurrency
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizers
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvm
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications
 
Celery
CeleryCelery
Celery
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrency
 

Dernier

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Dernier (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Fault Tolerance in a High Volume, Distributed System

  • 1. Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
  • 2. Dozens of dependencies. One going down takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  • 7. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  • 8. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  • 10. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 10
  • 11. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 11
  • 12. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 12
  • 13. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  • 14. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 14
  • 15. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 15
  • 16. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 16
  • 17. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 17
  • 18. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 18
  • 19. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 19
  • 20. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  • 21. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 21
  • 22. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 22
  • 23. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 23
  • 24. Netflix uses all 4 in combination 24
  • 25. 25
  • 26. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30
  • 31. Drawbacks of Separate Threads Some computational overhead Load on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  • 32. 32
  • 33. Visualizing Circuits in Realtime (generally sub-second latency) Video available at https://vimeo.com/33576628 33
  • 34. Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34
  • 41. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  • 44. Rolling Number Realtime Stats and Decision Making 44
  • 45. Request Collapsing Take advantage of resiliency to improve efficiency 45
  • 46. Request Collapsing Take advantage of resiliency to improve efficiency 46
  • 47. 47
  • 49. Questions & More Information Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49