SlideShare une entreprise Scribd logo
1  sur  88
Jeremy Edberg




Tweet @jedberg with feedback!
We live in an unreliable
          world




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Maintaining reliability in
 an unreliable world


Tweet @jedberg with feedback!
Netflix




Tweet @jedberg with feedback!
reddit




Tweet @jedberg with feedback!
A quick rant

  • I hate the term “Cloud Computing”
  • I use it because everyone else does and no
      one would understand me otherwise
  • One alternative is “Virtualized Computing”
  • I don’t have a better alternative

Tweet @jedberg with feedback!
Agenda

  • Money
  • Architecture
  • Process (or lack thereof)

Tweet @jedberg with feedback!
Reliability and $$




Tweet @jedberg with feedback!
Monthly Page Views and
  Costs for reddit
1,300M                                                  $130,000.00


1,080M                                                  $108,000.00


 860M                                                    $86,000.00


 640M                                                    $64,000.00


 420M                                                    $42,000.00


 200M                                                    $20,000.00
         Mar   May     Jul      Sep   Nov   Jan   Mar


Tweet @jedberg with feedback!
The control and
    complexity spectrum
Lots                            Little




Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Why did Netflix move
  out of the datacenter?
  • They didn’t know how fast the streaming
      service would grow and needed something
      that could grow with them
  • They wanted more redundancy than just
      the two datacenters they had
  • Autoscaling helps a lot too
Tweet @jedberg with feedback!
Netflix autoscaling
2




                                    Text
1




                               Traffic Peak


    Tweet @jedberg with feedback!
Agenda

  • Money
  • Architecture
  • Process (or lack thereof)

Tweet @jedberg with feedback!
Making sure we are
      building for survival



Tweet @jedberg with feedback!
Building for redundancy




Tweet @jedberg with feedback!
1>2>3
           Going from two to three is hard




Tweet @jedberg with feedback!
1>2>3
           Going from two to three is hard




Tweet @jedberg with feedback!
1>2>3
           Going from one to two is harder




Tweet @jedberg with feedback!
1>2>3
           Going from one to two is harder




Tweet @jedberg with feedback!
Build for Three
   If possible, plan for 3 or more from the beginning.




Tweet @jedberg with feedback!
Build for Three
   If possible, plan for 3 or more from the beginning.




Tweet @jedberg with feedback!
“Build for three” is the
  secret to success in a
   virtual environment




Tweet @jedberg with feedback!
“Build for three” is the
  secret to success in a
   virtual environment




Tweet @jedberg with feedback!
All systems choices
  assume some part will
    fail at some point.



Tweet @jedberg with feedback!
Database Resiliancy
        with Sharding




Tweet @jedberg with feedback!
Sharding
  •   reddit split writes across four master databases

  •   Links/Accounts/Subreddits, Comments,Votes
      and Misc

  •   Each has at least one slave in another zone

  •   Avoid reading from the master if possible

  •   Wrote their own database access layer, called
      the “thing” layer



Tweet @jedberg with feedback!
Cassandra




Tweet @jedberg with feedback!
Cassandra

  • 20 node cluster
  • ~40GB per node
  • Dynamo model

Tweet @jedberg with feedback!
Cassandra




Tweet @jedberg with feedback!
How it works
  • Replication factor
  • Quorum reads / writes
  • Bloom Filter for fast negative lookups
  • Immutable files for fast writes
  • Seed nodes
  • Multi-region
Tweet @jedberg with feedback!
Cassandra Benefits

  • Fast writes
  • Fast negative lookups
  • Easy incremental scalability
  • Distributed -- No SPoF

Tweet @jedberg with feedback!
I love memcache
            I make heavy use of memcached




Tweet @jedberg with feedback!
Second class users

  • Logged out users always get cached
      content.
  • Akamai bears the brunt of reddit’s traffic
  • Logged out users are about 80% of the
      traffic



Tweet @jedberg with feedback!
1     A


               2
                                        3



                   C                B



Tweet @jedberg with feedback!
1
                   D            A
               2
                                    3



                   C            B



Tweet @jedberg with feedback!
Get - Set - Get



Tweet @jedberg with feedback!
Caching is a good way
   to hide your failures



Tweet @jedberg with feedback!
reddit’s Caching




Tweet @jedberg with feedback!
reddit’s Caching
  • Render cache (5GB)
   • Partially and fully rendered items




Tweet @jedberg with feedback!
reddit’s Caching
  • Render cache (5GB)
   • Partially and fully rendered items
  • Data cache (15GB)
   • Chunks of data from the database


Tweet @jedberg with feedback!
reddit’s Caching
  • Render cache (5GB)
   • Partially and fully rendered items
  • Data cache (15GB)
   • Chunks of data from the database
  • Permacache (10GB)
   • Precomputed queries
Tweet @jedberg with feedback!
Sometimes users notice
 your data inconstancy




Tweet @jedberg with feedback!
Queues are your friend
  • Votes
  • Comments
  • Thumbnail scraper
  • Precomputed queries
  • Spam
   • processing
   • corrections
Tweet @jedberg with feedback!
Going multi-zone




Tweet @jedberg with feedback!
Benefits of Amazon’s
           Zones

  • Loosely connected
  • Low latency between zones
  • 99.95% uptime guarantee per zone

Tweet @jedberg with feedback!
Going Multi-region




Tweet @jedberg with feedback!
Leveraging Mutli-region

  • 100% uptime is theoretically possible.
  • You have to replicate your data
  • This will cost money

Tweet @jedberg with feedback!
Other options


  • Backup datacenter
  • Backup provider


Tweet @jedberg with feedback!
Example ---------




Tweet @jedberg with feedback!
Example 2 ---------




Tweet @jedberg with feedback!
Agenda

  • Money
  • Architecture
  • Process (or lack thereof )

Tweet @jedberg with feedback!
Monitoring

  • reddit uses Ganglia
  • Backed by RRD
  • Netflix uses RRD too
  • Makes good rollup graphs
  • Gives a great way to visually and
      programmatically detect errors

Tweet @jedberg with feedback!
Alert Systems
                                               CORE
                                                        Paging
                                               Event
                                                        Service
alerting                                      Gateway


                                     CORE               Amazon
                                     Agent               SES
  api




                                     CORE
  api
                                     Agent



                                     Other
                                     Team’s
                                     Agent


     Tweet @jedberg with feedback!
Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)




0                                                                                     Time




    Tweet @jedberg with feedback!                                      Slide courtesy of @royrapoport
Anatomy of an outage
      Life cycle of the common Streaming Production Problem (Productio Problematis genus)




 Something
bad happens




    0                                                                                       Time




        Tweet @jedberg with feedback!                                        Slide courtesy of @royrapoport
Anatomy of an outage
      Life cycle of the common Streaming Production Problem (Productio Problematis genus)

         Customer
          impact
 Something
bad happens




    0                                                                                       Time




        Tweet @jedberg with feedback!                                        Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)

         Customer
           impact
                Someone notices
 Something
                  (probably CS,
bad happens
               hopefully our alerts)




     0                                                                                       Time




         Tweet @jedberg with feedback!                                        Slide courtesy of @royrapoport
Anatomy of an outage
      Life cycle of the common Streaming Production Problem (Productio Problematis genus)

         Customer                 Prod
           impact                 alert
                Someone notices
 Something
                  (probably CS,
bad happens
               hopefully our alerts)




     0                                                                                      Time




         Tweet @jedberg with feedback!                                       Slide courtesy of @royrapoport
Anatomy of an outage
      Life cycle of the common Streaming Production Problem (Productio Problematis genus)

         Customer                 Prod
           impact                 alert
                Someone notices
 Something                              Determine
                  (probably CS,
bad happens                              impact
               hopefully our alerts)




    0                                                                                       Time




        Tweet @jedberg with feedback!                                        Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine
          Customer                  Prod
                                                  (nonroot)
            impact                  alert
                                                    cause
                   Someone notices
 Something                                Determine
                     (probably CS,
bad happens                                impact
                 hopefully our alerts)




    0                                                                                        Time




        Tweet @jedberg with feedback!                                         Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine
          Customer                  Prod
                                                  (nonroot)
            impact                  alert
                                                    cause
                   Someone notices
 Something                                Determine         Figure out
                     (probably CS,
bad happens                                impact               fix
                 hopefully our alerts)




    0                                                                                        Time




        Tweet @jedberg with feedback!                                         Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine         Deploy
          Customer                  Prod
                                                  (nonroot)          fix
            impact                  alert
                                                    cause
                   Someone notices
 Something                                Determine         Figure out
                     (probably CS,
bad happens                                impact               fix
                 hopefully our alerts)




    0                                                                                        Time




        Tweet @jedberg with feedback!                                         Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine         Deploy
          Customer                  Prod
                                                  (nonroot)          fix
            impact                  alert
                                                    cause
                   Someone notices
 Something                                Determine         Figure out        Recover
                     (probably CS,
bad happens                                impact               fix            service
                 hopefully our alerts)




    0                                                                                        Time




        Tweet @jedberg with feedback!                                         Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine         Deploy
          Customer                  Prod                                              Go back
                                                  (nonroot)          fix
            impact                  alert                                              to sleep
                                                    cause
                   Someone notices
 Something                                Determine         Figure out        Recover
                     (probably CS,
bad happens                                impact               fix            service
                 hopefully our alerts)




    0                                                                                         Time




        Tweet @jedberg with feedback!                                          Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine         Deploy
          Customer                  Prod                                              Go back
                                                  (nonroot)          fix
            impact                  alert                                              to sleep
                                                    cause
                   Someone notices
 Something                                Determine         Figure out        Recover
                     (probably CS,
bad happens                                impact               fix            service
                 hopefully our alerts)




    0                          TTD(etect)       TTC(ausation)      TTr(epair)   TTR(ecover)     Time




        Tweet @jedberg with feedback!                                            Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine         Deploy
          Customer                  Prod                                              Go back
                                                  (nonroot)          fix
            impact                  alert                                              to sleep
                                                    cause
                   Someone notices
 Something                                Determine         Figure out        Recover
                     (probably CS,
bad happens                                impact               fix            service
                 hopefully our alerts)




    0                          TTD(etect)       TTC(ausation)      TTr(epair)   TTR(ecover)     Time




        Tweet @jedberg with feedback!                                            Slide courtesy of @royrapoport
Anatomy of an outage
       Life cycle of the common Streaming Production Problem (Productio Problematis genus)
                                                 Determine         Deploy
          Customer                  Prod                                              Go back
                                                  (nonroot)          fix
            impact                  alert                                              to sleep
                                                    cause
                   Someone notices
 Something                                Determine         Figure out        Recover
                     (probably CS,
bad happens                                impact               fix            service
                 hopefully our alerts)




    0                          TTD(etect)       TTC(ausation)      TTr(epair)   TTR(ecover)     Time
                                        outage time


        Tweet @jedberg with feedback!                                            Slide courtesy of @royrapoport
Automate all the things!




Tweet @jedberg with feedback!
Automate all the things

  • Application startup
  • Configuration
  • Code deployment
  • System deployment

Tweet @jedberg with feedback!
Automation

  • Standard base image
  • Tools to manage all the systems
  • Automated code deployment

Tweet @jedberg with feedback!
The best ops folks of
    course already know
            this.



Tweet @jedberg with feedback!
Netflix has moved the
    granularity from the
  instance to the cluster


Tweet @jedberg with feedback!
The next level

  • Everything is “built for three”
  • Fully automated build tools to test and
      make packages
  • Fully automated machine image bakery
  • Fully automated image deployment

Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
The Monkey Theory


  • Simulate things that go wrong
  • Find things that are different


Tweet @jedberg with feedback!
The simian army
  • Chaos -- Kills random instances
  • Latency -- Slows the network down
  • Conformity -- Looks for outliers
  • Doctor -- Looks for passing health checks
  • Janitor -- Cleans up unused resources
  • Howler -- Yells about bad things
Tweet @jedberg with feedback!
War Stories



Tweet @jedberg with feedback!
April EBS outage




Tweet @jedberg with feedback!
August region failure




Tweet @jedberg with feedback!
Hurricane Irene
             The outage that never was




Tweet @jedberg with feedback!
Questions?




Tweet @jedberg with feedback!
Getting in touch
  Email: jedberg@gmail.com

  Twitter: @jedberg

  Web: www.jedberg.net

  Facebook: facebook.com/jedberg

  Linkedin: www.linkedin.com/in/jedberg

  reddit: www.reddit.com/user/jedberg

Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!

Contenu connexe

En vedette

自助工具助Dba提升效率
自助工具助Dba提升效率自助工具助Dba提升效率
自助工具助Dba提升效率Chao Zhu
 
Performance and Fault Tolerance for the Netflix API - QCon Sao Paulo
Performance and Fault Tolerance for the Netflix API - QCon Sao PauloPerformance and Fault Tolerance for the Netflix API - QCon Sao Paulo
Performance and Fault Tolerance for the Netflix API - QCon Sao PauloBen Christensen
 
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Journey to IPv6 - A Real-World deployment for Mobiles
Journey to IPv6 - A Real-World deployment for MobilesJourney to IPv6 - A Real-World deployment for Mobiles
Journey to IPv6 - A Real-World deployment for MobilesAPNIC
 
Netflix API - Separation of Concerns
Netflix API - Separation of ConcernsNetflix API - Separation of Concerns
Netflix API - Separation of ConcernsDaniel Jacobson
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Jeremy Edberg
 
Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
 
Netflix oss season 1 episode 3
Netflix oss season 1 episode 3 Netflix oss season 1 episode 3
Netflix oss season 1 episode 3 Ruslan Meshenberg
 

En vedette (11)

Culture
CultureCulture
Culture
 
自助工具助Dba提升效率
自助工具助Dba提升效率自助工具助Dba提升效率
自助工具助Dba提升效率
 
Performance and Fault Tolerance for the Netflix API - QCon Sao Paulo
Performance and Fault Tolerance for the Netflix API - QCon Sao PauloPerformance and Fault Tolerance for the Netflix API - QCon Sao Paulo
Performance and Fault Tolerance for the Netflix API - QCon Sao Paulo
 
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Journey to IPv6 - A Real-World deployment for Mobiles
Journey to IPv6 - A Real-World deployment for MobilesJourney to IPv6 - A Real-World deployment for Mobiles
Journey to IPv6 - A Real-World deployment for Mobiles
 
Netflix API - Separation of Concerns
Netflix API - Separation of ConcernsNetflix API - Separation of Concerns
Netflix API - Separation of Concerns
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)
 
Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016
 
Netflix oss season 1 episode 3
Netflix oss season 1 episode 3 Netflix oss season 1 episode 3
Netflix oss season 1 episode 3
 

Similaire à Maintaining reliability in an unreliable world

RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...Amazon Web Services
 
'An Evolution Into Specification By Example' by Adam Knight
'An Evolution Into Specification By Example' by Adam Knight'An Evolution Into Specification By Example' by Adam Knight
'An Evolution Into Specification By Example' by Adam KnightTEST Huddle
 
From incubator to exit: A brief history of Reddit, the first YCombinator success
From incubator to exit: A brief history of Reddit, the first YCombinator successFrom incubator to exit: A brief history of Reddit, the first YCombinator success
From incubator to exit: A brief history of Reddit, the first YCombinator successStartupfest
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
Clean Code - 5
Clean Code - 5Clean Code - 5
Clean Code - 5Don Kim
 
Startupfest 2012 - why you should share
Startupfest 2012 - why you should shareStartupfest 2012 - why you should share
Startupfest 2012 - why you should shareStartupfest
 
Metadata is a Love Note to the Future
Metadata is a Love Note to the FutureMetadata is a Love Note to the Future
Metadata is a Love Note to the FutureRachel Lovinger
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodbMitch Pirtle
 
SEERS - Standardised Bug Reporting
SEERS - Standardised Bug ReportingSEERS - Standardised Bug Reporting
SEERS - Standardised Bug Reportingbooshtukka
 
Ruby codebases in an entropic universe
Ruby codebases in an entropic universeRuby codebases in an entropic universe
Ruby codebases in an entropic universeNiranjan Paranjape
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
 
Coder sans peur du changement avec la meme pas mal hexagonal architecture
Coder sans peur du changement avec la meme pas mal hexagonal architectureCoder sans peur du changement avec la meme pas mal hexagonal architecture
Coder sans peur du changement avec la meme pas mal hexagonal architectureThomas Pierrain
 
2015 JavaOne EJB/CDI Alignment
2015 JavaOne EJB/CDI Alignment2015 JavaOne EJB/CDI Alignment
2015 JavaOne EJB/CDI AlignmentDavid Blevins
 
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Clinton Wolfe
 

Similaire à Maintaining reliability in an unreliable world (20)

RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
 
True Git
True Git True Git
True Git
 
'An Evolution Into Specification By Example' by Adam Knight
'An Evolution Into Specification By Example' by Adam Knight'An Evolution Into Specification By Example' by Adam Knight
'An Evolution Into Specification By Example' by Adam Knight
 
From incubator to exit: A brief history of Reddit, the first YCombinator success
From incubator to exit: A brief history of Reddit, the first YCombinator successFrom incubator to exit: A brief history of Reddit, the first YCombinator success
From incubator to exit: A brief history of Reddit, the first YCombinator success
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Twitter by the Numbers
Twitter by the NumbersTwitter by the Numbers
Twitter by the Numbers
 
Clean Code - 5
Clean Code - 5Clean Code - 5
Clean Code - 5
 
Startupfest 2012 - why you should share
Startupfest 2012 - why you should shareStartupfest 2012 - why you should share
Startupfest 2012 - why you should share
 
Metadata is a Love Note to the Future
Metadata is a Love Note to the FutureMetadata is a Love Note to the Future
Metadata is a Love Note to the Future
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
 
Ds @ bol
Ds @ bolDs @ bol
Ds @ bol
 
SEERS - Standardised Bug Reporting
SEERS - Standardised Bug ReportingSEERS - Standardised Bug Reporting
SEERS - Standardised Bug Reporting
 
Ruby codebases in an entropic universe
Ruby codebases in an entropic universeRuby codebases in an entropic universe
Ruby codebases in an entropic universe
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Story of reCAPTCHA
Story of  reCAPTCHAStory of  reCAPTCHA
Story of reCAPTCHA
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
Coder sans peur du changement avec la meme pas mal hexagonal architecture
Coder sans peur du changement avec la meme pas mal hexagonal architectureCoder sans peur du changement avec la meme pas mal hexagonal architecture
Coder sans peur du changement avec la meme pas mal hexagonal architecture
 
2015 JavaOne EJB/CDI Alignment
2015 JavaOne EJB/CDI Alignment2015 JavaOne EJB/CDI Alignment
2015 JavaOne EJB/CDI Alignment
 
2015 JavaOne EJB/CDI Alignment
2015 JavaOne EJB/CDI Alignment2015 JavaOne EJB/CDI Alignment
2015 JavaOne EJB/CDI Alignment
 
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
 

Dernier

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Dernier (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Maintaining reliability in an unreliable world

  • 2. We live in an unreliable world Tweet @jedberg with feedback!
  • 6. Maintaining reliability in an unreliable world Tweet @jedberg with feedback!
  • 9. A quick rant • I hate the term “Cloud Computing” • I use it because everyone else does and no one would understand me otherwise • One alternative is “Virtualized Computing” • I don’t have a better alternative Tweet @jedberg with feedback!
  • 10. Agenda • Money • Architecture • Process (or lack thereof) Tweet @jedberg with feedback!
  • 11. Reliability and $$ Tweet @jedberg with feedback!
  • 12. Monthly Page Views and Costs for reddit 1,300M $130,000.00 1,080M $108,000.00 860M $86,000.00 640M $64,000.00 420M $42,000.00 200M $20,000.00 Mar May Jul Sep Nov Jan Mar Tweet @jedberg with feedback!
  • 13. The control and complexity spectrum Lots Little Tweet @jedberg with feedback!
  • 14. Tweet @jedberg with feedback!
  • 15. Why did Netflix move out of the datacenter? • They didn’t know how fast the streaming service would grow and needed something that could grow with them • They wanted more redundancy than just the two datacenters they had • Autoscaling helps a lot too Tweet @jedberg with feedback!
  • 16. Netflix autoscaling 2 Text 1 Traffic Peak Tweet @jedberg with feedback!
  • 17. Agenda • Money • Architecture • Process (or lack thereof) Tweet @jedberg with feedback!
  • 18. Making sure we are building for survival Tweet @jedberg with feedback!
  • 19. Building for redundancy Tweet @jedberg with feedback!
  • 20. 1>2>3 Going from two to three is hard Tweet @jedberg with feedback!
  • 21. 1>2>3 Going from two to three is hard Tweet @jedberg with feedback!
  • 22. 1>2>3 Going from one to two is harder Tweet @jedberg with feedback!
  • 23. 1>2>3 Going from one to two is harder Tweet @jedberg with feedback!
  • 24. Build for Three If possible, plan for 3 or more from the beginning. Tweet @jedberg with feedback!
  • 25. Build for Three If possible, plan for 3 or more from the beginning. Tweet @jedberg with feedback!
  • 26. “Build for three” is the secret to success in a virtual environment Tweet @jedberg with feedback!
  • 27. “Build for three” is the secret to success in a virtual environment Tweet @jedberg with feedback!
  • 28. All systems choices assume some part will fail at some point. Tweet @jedberg with feedback!
  • 29. Database Resiliancy with Sharding Tweet @jedberg with feedback!
  • 30. Sharding • reddit split writes across four master databases • Links/Accounts/Subreddits, Comments,Votes and Misc • Each has at least one slave in another zone • Avoid reading from the master if possible • Wrote their own database access layer, called the “thing” layer Tweet @jedberg with feedback!
  • 32. Cassandra • 20 node cluster • ~40GB per node • Dynamo model Tweet @jedberg with feedback!
  • 34. How it works • Replication factor • Quorum reads / writes • Bloom Filter for fast negative lookups • Immutable files for fast writes • Seed nodes • Multi-region Tweet @jedberg with feedback!
  • 35. Cassandra Benefits • Fast writes • Fast negative lookups • Easy incremental scalability • Distributed -- No SPoF Tweet @jedberg with feedback!
  • 36. I love memcache I make heavy use of memcached Tweet @jedberg with feedback!
  • 37. Second class users • Logged out users always get cached content. • Akamai bears the brunt of reddit’s traffic • Logged out users are about 80% of the traffic Tweet @jedberg with feedback!
  • 38. 1 A 2 3 C B Tweet @jedberg with feedback!
  • 39. 1 D A 2 3 C B Tweet @jedberg with feedback!
  • 40. Get - Set - Get Tweet @jedberg with feedback!
  • 41. Caching is a good way to hide your failures Tweet @jedberg with feedback!
  • 43. reddit’s Caching • Render cache (5GB) • Partially and fully rendered items Tweet @jedberg with feedback!
  • 44. reddit’s Caching • Render cache (5GB) • Partially and fully rendered items • Data cache (15GB) • Chunks of data from the database Tweet @jedberg with feedback!
  • 45. reddit’s Caching • Render cache (5GB) • Partially and fully rendered items • Data cache (15GB) • Chunks of data from the database • Permacache (10GB) • Precomputed queries Tweet @jedberg with feedback!
  • 46. Sometimes users notice your data inconstancy Tweet @jedberg with feedback!
  • 47. Queues are your friend • Votes • Comments • Thumbnail scraper • Precomputed queries • Spam • processing • corrections Tweet @jedberg with feedback!
  • 49. Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per zone Tweet @jedberg with feedback!
  • 51. Leveraging Mutli-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost money Tweet @jedberg with feedback!
  • 52. Other options • Backup datacenter • Backup provider Tweet @jedberg with feedback!
  • 54. Example 2 --------- Tweet @jedberg with feedback!
  • 55. Agenda • Money • Architecture • Process (or lack thereof ) Tweet @jedberg with feedback!
  • 56. Monitoring • reddit uses Ganglia • Backed by RRD • Netflix uses RRD too • Makes good rollup graphs • Gives a great way to visually and programmatically detect errors Tweet @jedberg with feedback!
  • 57. Alert Systems CORE Paging Event Service alerting Gateway CORE Amazon Agent SES api CORE api Agent Other Team’s Agent Tweet @jedberg with feedback!
  • 58. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 59. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Something bad happens 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 60. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer impact Something bad happens 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 61. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer impact Someone notices Something (probably CS, bad happens hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 62. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer Prod impact alert Someone notices Something (probably CS, bad happens hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 63. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer Prod impact alert Someone notices Something Determine (probably CS, bad happens impact hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 64. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Customer Prod (nonroot) impact alert cause Someone notices Something Determine (probably CS, bad happens impact hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 65. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Customer Prod (nonroot) impact alert cause Someone notices Something Determine Figure out (probably CS, bad happens impact fix hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 66. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod (nonroot) fix impact alert cause Someone notices Something Determine Figure out (probably CS, bad happens impact fix hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 67. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod (nonroot) fix impact alert cause Someone notices Something Determine Figure out Recover (probably CS, bad happens impact fix service hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 68. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS, bad happens impact fix service hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 69. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS, bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 70. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS, bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 71. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS, bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time outage time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • 72. Automate all the things! Tweet @jedberg with feedback!
  • 73. Automate all the things • Application startup • Configuration • Code deployment • System deployment Tweet @jedberg with feedback!
  • 74. Automation • Standard base image • Tools to manage all the systems • Automated code deployment Tweet @jedberg with feedback!
  • 75. The best ops folks of course already know this. Tweet @jedberg with feedback!
  • 76. Netflix has moved the granularity from the instance to the cluster Tweet @jedberg with feedback!
  • 77. The next level • Everything is “built for three” • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deployment Tweet @jedberg with feedback!
  • 78. Tweet @jedberg with feedback!
  • 79. The Monkey Theory • Simulate things that go wrong • Find things that are different Tweet @jedberg with feedback!
  • 80. The simian army • Chaos -- Kills random instances • Latency -- Slows the network down • Conformity -- Looks for outliers • Doctor -- Looks for passing health checks • Janitor -- Cleans up unused resources • Howler -- Yells about bad things Tweet @jedberg with feedback!
  • 81. War Stories Tweet @jedberg with feedback!
  • 82. April EBS outage Tweet @jedberg with feedback!
  • 83. August region failure Tweet @jedberg with feedback!
  • 84. Hurricane Irene The outage that never was Tweet @jedberg with feedback!
  • 86. Getting in touch Email: jedberg@gmail.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg reddit: www.reddit.com/user/jedberg Tweet @jedberg with feedback!
  • 87. Tweet @jedberg with feedback!
  • 88. Tweet @jedberg with feedback!

Notes de l'éditeur

  1. My name is jeremy\n\nThanks for coming to see me this evening.\n
  2. We live in an unreliable world\nThings never go as we expect.\n\n
  3. As the great murphy said, if it can go wrong, it will.\n
  4. The world of cloud computing was supposed to be our shining light\nand solve all of our problems.\n
  5. But it isn’t all fluffy and white.\nSometimes the clouds break too.\n
  6. Which is why I’m here today to tell you about how we maintain reliability\nin this unreliable world.\n\nSo who am I?\n
  7. I work for Netflix as the lead site reliability engineer\nPublic studies say that we are responsible for as much as 30% of all internet traffic\non a Saturday night.\nWe run almost the entire streaming service from the cloud,\nsave for a few legacy systems that we haven’t moved from the DC yet.\n
  8. I used to work for reddit.\nreddit is a community where people come together\nshare and discuss interesting things on the internet \nsuch as links to other stuff \nor create their own content.\nIt does more than 2 Billion pageviews a month\nall from EC2.\n
  9. \n
  10. \n
  11. uptime and money go hand in hand. \nWith infinite money, you can probably get perfect uptime.\n\nBut if you don’t have infinite money, you have to find the right uptime for your budget.\n
  12. Luckily, the cloud makes it pretty easy to find the right balance, \nbecause you can leverage their economies of scale.\nand the ease with which you can start and stop instances.\nHere is reddit’s costs and pageviews from when I was there,\nwhich is the last data I have.\n\n\n\n
  13. You have lots of control and lots of complexity,\nWith appengine and heroku you have little,\nand Amazon (and Rackspace, etc) is in the middle.\n
  14. This was reddit’s server farm in 2008\nmostly I just like to brag about how clean it was.\n\n
  15. With autoscaling, we can only pay for resources\nthat we actually need.\n
  16. At Netflix we use autoscaling the help manage\nreliability and cost.\nHere is one of our clusters scaling up and down.\nWe are tuning for the holidays, so you can see parts\nwhere we are doing squeeze tests and adjusting the\nscaling speed and values.\n
  17. \n
  18. We need to make sure that we are doing everything we\ncan to ensure our survival.\n
  19. They key to surviving any outage is redundancy in your systems,\nbe it the cloud or a datacenter.\nMy teammate points out that this is recursion, not redundancy\n
  20. Going from two\n\n(keypress)\n\nto three is hard\n
  21. Going from one\n\n(key press)\n\nto two, is harder.\n\nWhat do I mean by that?\n\nanywhere you will need more than one of something \n(application process, database, cache, queue, whatever)\nIt will be harder to go from one to two than two to three\nand so on.\n\nEspecially relevant in a cloud setting, since getting more resources is so easy.\n
  22. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  23. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  24. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  25. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  26. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  27. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  28. (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  29. \n
  30. \n
  31. \n
  32. By building for three, you can reasonably lose one of your instances and still be stable.\n
  33. And now some database scaling.\n
  34. We use 4 master databases\n\nThey are split up as Links/Accounts/Subreddit on the main db\nThen separate db’s for the comments, votes, and everything else\n\nEach has at least one slave -- The comments have 4 slaves\n\nWhen they get busy, we add a slave. Thanks to EC2,\nwe don’t have to requisition servers!\n\nKeep writes fast by not reading from the master when safe.\n\nAnd lastly our data access layer,\nthe thing layer -- creative, isn’t it?\nwe wrote it because there was no good\ndatabase ORM at the time.\n
  35. Dynamo is the name of a database at amazon that divides data in a way that give high fault tolerance durability.\n
  36. replication factor\n\nquorum reads / writes\n
  37. \n
  38. \n
  39. Did I mention caching is good?\n
  40. Another kind of caching is CDN caching.\n\nSince a logged out user isn’t voting or commenting, \nthe pages looks the same for all of them, so we render\nand cache the full page every 30 seconds.\n\nThen akamai grabs it and caches it for 30 seconds as well.\n\nakamai also accelerates the connection for our logged in users\n
  41. Use consistent key hashing\n
  42. Only one chunk of data changes places.\n
  43. Here’s a memcache tip (or cassandra or whatever\neventually consistent system you want) for locking.\nTo get reasonable locking, get the data,\nset the data, wait whatever the 95th\npercentile latency is, then do another get.\nIf it is still your lock, then you can reasonably say\nyou have the lock.\n
  44. \n
  45. (step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
  46. (step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
  47. (step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
  48. But usually they don’t\nwhich makes hiding the failures easier\n
  49. Every vote generates a queue item for later processing. We just update the rendercache with the vote until it is processed. It also puts a job in the precomputer queue to recalculate that listing. That is why it sometimes seems like the vote totals aren’t accurate. They are consistent in the database, it is just a rendering issue.\n\nEvery time a comment is written, it generates a job to have the comments tree for that page recalculated, which is then stored in the cache\n\nEvery time a link is submitted, it generates a job to go out and get a thumbnail\nand it generates a job to recalculate the listing. It also generates a job for the spam filter.\n \nWhen a moderator bans or unbans a link, it generates a job to\ntrain the spam filter on that data.\n\nQueues are extra important in a virtualized environment, because if you loose an instance\nyou don’t have to worry about the lost work, as long as your queue is redundant.\n
  50. Amazon will help you as well.\n\nOne way they do this is by providing zones. Each zone is like an island\nthat is loosely connected to the other zones, but mostly distinct. \n
  51. So how do you get better than 99.95% uptime? Multiple zones!\n\nBy spreading your systems out across multiple zones, you should be able\nto withstand the failure of one zone.\n\nIn a little bit, I’ll go over how reddit and Netflix used a multizone strategy\nto survive outages.\n
  52. Amazon, as well as other providers, offer multiple regions as well.\n\nRegions are essentially like separate providers with the same featureset.\n\nYour data does not get shared across regions\n
  53. \n
  54. \n
  55. multiple zones\n\nsome db slaves are in different zones from master for redundancy\n\nmonolithic, highly cached.\n\n
  56. Service oriented architecture is basically just a bunch of small reddits all talking to each other.\neasier for larger groups of devs.\nmore scalable -- just scale out the services that are overloaded\nmore focused optimizations -- just the ones that are biggest.\neasier to scale down\n
  57. \n
  58. \n
  59. Gateway classifies and routes events based on severity\nand the systems involved.\nThe gateway currently processes around 48K events a day\n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. Automate as much as you can\n\n
  97. The more automated things are, the easier it is to be a sysadmin.\nApplication startup\nConfiguration\nCode deployment\nFull system deployment\n\nThe more automated things are, the easier it is to scale\nespecially in a virtualized environment with auto-scaling\n\nAnd virtualized computing added the last bit,\nthe ability to automate system deployment.\n(Ok, that’s not entirely true, but watch me\nwave my hands and say it is)\n
  98. In most places, you have this.\n\nStandard image with tools to manage the systems and the deployment.\n
  99. It’s manifested every day by writing scripts and programs\nto do your repetitive tasks for you.\n\nPeople basically figured out how to do this with whole computers instead.\n\nIn case you’re wondering, that picture is what came up when I googled for\n“lazy sysadmin”\n
  100. In most systems, you worry about the software and installing it on an OS.\nAt Netflix, the smallest thing we worry about is the instance image,\nwhich lives in a cluster.\n\nWe’ve essentially built a platform for doing automated deployment\nof Java code (and some Python too!)\n
  101. My friends Joe and Carl already told you about Nac and our build system.\n\nThis allows the devs to take control of their deployment.\n\nEach team is responsible for their own deployments and uptime.\n\nWhen something breaks, we have a system that lets us page a team\nwho then gets on and fixes their stuff.\n\nEach team is responsible for their own destiny.\n\nSo how do we stay reliable when we have no control?\n\nInformation. \n
  102. This is an example of the task list from NAC.\n\nIt shows me and everyone else all of the actions\n people have taken in our Amazon infrastructure.\n\nthis lets everyone know when it is safe to deploy,\nwhat is going on, etc.\n\nRight now my team is building additional tools to\nprovide information so other teams can make\ngood decisions.\n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. In mid August, a hurricane slammed the east coast\nAmazon warned of a possible zone outage.\n
  109. I’m here for you to learn. If you have any questions, please jump in.\nThis will be boring for both of us otherwise.\nAnd I’ve got a couple of hours to fill anyway.\n
  110. You can contact me in one of these ways,\n or ask your question now.\n\nthank you.\n
  111. \n
  112. Speaking of BttF, here is a delorian towing a delorian\n