9. A quick rant
• I hate the term “Cloud Computing”
• I use it because everyone else does and no
one would understand me otherwise
• One alternative is “Virtualized Computing”
• I don’t have a better alternative
Tweet @jedberg with feedback!
10. Agenda
• Money
• Architecture
• Process (or lack thereof)
Tweet @jedberg with feedback!
12. Monthly Page Views and
Costs for reddit
1,300M $130,000.00
1,080M $108,000.00
860M $86,000.00
640M $64,000.00
420M $42,000.00
200M $20,000.00
Mar May Jul Sep Nov Jan Mar
Tweet @jedberg with feedback!
13. The control and
complexity spectrum
Lots Little
Tweet @jedberg with feedback!
15. Why did Netflix move
out of the datacenter?
• They didn’t know how fast the streaming
service would grow and needed something
that could grow with them
• They wanted more redundancy than just
the two datacenters they had
• Autoscaling helps a lot too
Tweet @jedberg with feedback!
30. Sharding
• reddit split writes across four master databases
• Links/Accounts/Subreddits, Comments,Votes
and Misc
• Each has at least one slave in another zone
• Avoid reading from the master if possible
• Wrote their own database access layer, called
the “thing” layer
Tweet @jedberg with feedback!
34. How it works
• Replication factor
• Quorum reads / writes
• Bloom Filter for fast negative lookups
• Immutable files for fast writes
• Seed nodes
• Multi-region
Tweet @jedberg with feedback!
35. Cassandra Benefits
• Fast writes
• Fast negative lookups
• Easy incremental scalability
• Distributed -- No SPoF
Tweet @jedberg with feedback!
36. I love memcache
I make heavy use of memcached
Tweet @jedberg with feedback!
37. Second class users
• Logged out users always get cached
content.
• Akamai bears the brunt of reddit’s traffic
• Logged out users are about 80% of the
traffic
Tweet @jedberg with feedback!
43. reddit’s Caching
• Render cache (5GB)
• Partially and fully rendered items
Tweet @jedberg with feedback!
44. reddit’s Caching
• Render cache (5GB)
• Partially and fully rendered items
• Data cache (15GB)
• Chunks of data from the database
Tweet @jedberg with feedback!
45. reddit’s Caching
• Render cache (5GB)
• Partially and fully rendered items
• Data cache (15GB)
• Chunks of data from the database
• Permacache (10GB)
• Precomputed queries
Tweet @jedberg with feedback!
51. Leveraging Mutli-region
• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money
Tweet @jedberg with feedback!
52. Other options
• Backup datacenter
• Backup provider
Tweet @jedberg with feedback!
55. Agenda
• Money
• Architecture
• Process (or lack thereof )
Tweet @jedberg with feedback!
56. Monitoring
• reddit uses Ganglia
• Backed by RRD
• Netflix uses RRD too
• Makes good rollup graphs
• Gives a great way to visually and
programmatically detect errors
Tweet @jedberg with feedback!
57. Alert Systems
CORE
Paging
Event
Service
alerting Gateway
CORE Amazon
Agent SES
api
CORE
api
Agent
Other
Team’s
Agent
Tweet @jedberg with feedback!
58. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
59. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Something
bad happens
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
60. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Customer
impact
Something
bad happens
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
61. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Customer
impact
Someone notices
Something
(probably CS,
bad happens
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
62. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Customer Prod
impact alert
Someone notices
Something
(probably CS,
bad happens
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
63. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Customer Prod
impact alert
Someone notices
Something Determine
(probably CS,
bad happens impact
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
64. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine
Customer Prod
(nonroot)
impact alert
cause
Someone notices
Something Determine
(probably CS,
bad happens impact
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
65. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine
Customer Prod
(nonroot)
impact alert
cause
Someone notices
Something Determine Figure out
(probably CS,
bad happens impact fix
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
66. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine Deploy
Customer Prod
(nonroot) fix
impact alert
cause
Someone notices
Something Determine Figure out
(probably CS,
bad happens impact fix
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
67. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine Deploy
Customer Prod
(nonroot) fix
impact alert
cause
Someone notices
Something Determine Figure out Recover
(probably CS,
bad happens impact fix service
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
68. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine Deploy
Customer Prod Go back
(nonroot) fix
impact alert to sleep
cause
Someone notices
Something Determine Figure out Recover
(probably CS,
bad happens impact fix service
hopefully our alerts)
0 Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
69. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine Deploy
Customer Prod Go back
(nonroot) fix
impact alert to sleep
cause
Someone notices
Something Determine Figure out Recover
(probably CS,
bad happens impact fix service
hopefully our alerts)
0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
70. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine Deploy
Customer Prod Go back
(nonroot) fix
impact alert to sleep
cause
Someone notices
Something Determine Figure out Recover
(probably CS,
bad happens impact fix service
hopefully our alerts)
0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
71. Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)
Determine Deploy
Customer Prod Go back
(nonroot) fix
impact alert to sleep
cause
Someone notices
Something Determine Figure out Recover
(probably CS,
bad happens impact fix service
hopefully our alerts)
0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time
outage time
Tweet @jedberg with feedback! Slide courtesy of @royrapoport
73. Automate all the things
• Application startup
• Configuration
• Code deployment
• System deployment
Tweet @jedberg with feedback!
74. Automation
• Standard base image
• Tools to manage all the systems
• Automated code deployment
Tweet @jedberg with feedback!
75. The best ops folks of
course already know
this.
Tweet @jedberg with feedback!
76. Netflix has moved the
granularity from the
instance to the cluster
Tweet @jedberg with feedback!
77. The next level
• Everything is “built for three”
• Fully automated build tools to test and
make packages
• Fully automated machine image bakery
• Fully automated image deployment
Tweet @jedberg with feedback!
79. The Monkey Theory
• Simulate things that go wrong
• Find things that are different
Tweet @jedberg with feedback!
80. The simian army
• Chaos -- Kills random instances
• Latency -- Slows the network down
• Conformity -- Looks for outliers
• Doctor -- Looks for passing health checks
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things
Tweet @jedberg with feedback!
My name is jeremy\n\nThanks for coming to see me this evening.\n
We live in an unreliable world\nThings never go as we expect.\n\n
As the great murphy said, if it can go wrong, it will.\n
The world of cloud computing was supposed to be our shining light\nand solve all of our problems.\n
But it isn’t all fluffy and white.\nSometimes the clouds break too.\n
Which is why I’m here today to tell you about how we maintain reliability\nin this unreliable world.\n\nSo who am I?\n
I work for Netflix as the lead site reliability engineer\nPublic studies say that we are responsible for as much as 30% of all internet traffic\non a Saturday night.\nWe run almost the entire streaming service from the cloud,\nsave for a few legacy systems that we haven’t moved from the DC yet.\n
I used to work for reddit.\nreddit is a community where people come together\nshare and discuss interesting things on the internet \nsuch as links to other stuff \nor create their own content.\nIt does more than 2 Billion pageviews a month\nall from EC2.\n
\n
\n
uptime and money go hand in hand. \nWith infinite money, you can probably get perfect uptime.\n\nBut if you don’t have infinite money, you have to find the right uptime for your budget.\n
Luckily, the cloud makes it pretty easy to find the right balance, \nbecause you can leverage their economies of scale.\nand the ease with which you can start and stop instances.\nHere is reddit’s costs and pageviews from when I was there,\nwhich is the last data I have.\n\n\n\n
You have lots of control and lots of complexity,\nWith appengine and heroku you have little,\nand Amazon (and Rackspace, etc) is in the middle.\n
This was reddit’s server farm in 2008\nmostly I just like to brag about how clean it was.\n\n
With autoscaling, we can only pay for resources\nthat we actually need.\n
At Netflix we use autoscaling the help manage\nreliability and cost.\nHere is one of our clusters scaling up and down.\nWe are tuning for the holidays, so you can see parts\nwhere we are doing squeeze tests and adjusting the\nscaling speed and values.\n
\n
We need to make sure that we are doing everything we\ncan to ensure our survival.\n
They key to surviving any outage is redundancy in your systems,\nbe it the cloud or a datacenter.\nMy teammate points out that this is recursion, not redundancy\n
Going from two\n\n(keypress)\n\nto three is hard\n
Going from one\n\n(key press)\n\nto two, is harder.\n\nWhat do I mean by that?\n\nanywhere you will need more than one of something \n(application process, database, cache, queue, whatever)\nIt will be harder to go from one to two than two to three\nand so on.\n\nEspecially relevant in a cloud setting, since getting more resources is so easy.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
(wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
\n
\n
\n
By building for three, you can reasonably lose one of your instances and still be stable.\n
And now some database scaling.\n
We use 4 master databases\n\nThey are split up as Links/Accounts/Subreddit on the main db\nThen separate db’s for the comments, votes, and everything else\n\nEach has at least one slave -- The comments have 4 slaves\n\nWhen they get busy, we add a slave. Thanks to EC2,\nwe don’t have to requisition servers!\n\nKeep writes fast by not reading from the master when safe.\n\nAnd lastly our data access layer,\nthe thing layer -- creative, isn’t it?\nwe wrote it because there was no good\ndatabase ORM at the time.\n
Dynamo is the name of a database at amazon that divides data in a way that give high fault tolerance durability.\n
replication factor\n\nquorum reads / writes\n
\n
\n
Did I mention caching is good?\n
Another kind of caching is CDN caching.\n\nSince a logged out user isn’t voting or commenting, \nthe pages looks the same for all of them, so we render\nand cache the full page every 30 seconds.\n\nThen akamai grabs it and caches it for 30 seconds as well.\n\nakamai also accelerates the connection for our logged in users\n
Use consistent key hashing\n
Only one chunk of data changes places.\n
Here’s a memcache tip (or cassandra or whatever\neventually consistent system you want) for locking.\nTo get reasonable locking, get the data,\nset the data, wait whatever the 95th\npercentile latency is, then do another get.\nIf it is still your lock, then you can reasonably say\nyou have the lock.\n
\n
(step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
(step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
(step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
But usually they don’t\nwhich makes hiding the failures easier\n
Every vote generates a queue item for later processing. We just update the rendercache with the vote until it is processed. It also puts a job in the precomputer queue to recalculate that listing. That is why it sometimes seems like the vote totals aren’t accurate. They are consistent in the database, it is just a rendering issue.\n\nEvery time a comment is written, it generates a job to have the comments tree for that page recalculated, which is then stored in the cache\n\nEvery time a link is submitted, it generates a job to go out and get a thumbnail\nand it generates a job to recalculate the listing. It also generates a job for the spam filter.\n \nWhen a moderator bans or unbans a link, it generates a job to\ntrain the spam filter on that data.\n\nQueues are extra important in a virtualized environment, because if you loose an instance\nyou don’t have to worry about the lost work, as long as your queue is redundant.\n
Amazon will help you as well.\n\nOne way they do this is by providing zones. Each zone is like an island\nthat is loosely connected to the other zones, but mostly distinct. \n
So how do you get better than 99.95% uptime? Multiple zones!\n\nBy spreading your systems out across multiple zones, you should be able\nto withstand the failure of one zone.\n\nIn a little bit, I’ll go over how reddit and Netflix used a multizone strategy\nto survive outages.\n
Amazon, as well as other providers, offer multiple regions as well.\n\nRegions are essentially like separate providers with the same featureset.\n\nYour data does not get shared across regions\n
\n
\n
multiple zones\n\nsome db slaves are in different zones from master for redundancy\n\nmonolithic, highly cached.\n\n
Service oriented architecture is basically just a bunch of small reddits all talking to each other.\neasier for larger groups of devs.\nmore scalable -- just scale out the services that are overloaded\nmore focused optimizations -- just the ones that are biggest.\neasier to scale down\n
\n
\n
Gateway classifies and routes events based on severity\nand the systems involved.\nThe gateway currently processes around 48K events a day\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Automate as much as you can\n\n
The more automated things are, the easier it is to be a sysadmin.\nApplication startup\nConfiguration\nCode deployment\nFull system deployment\n\nThe more automated things are, the easier it is to scale\nespecially in a virtualized environment with auto-scaling\n\nAnd virtualized computing added the last bit,\nthe ability to automate system deployment.\n(Ok, that’s not entirely true, but watch me\nwave my hands and say it is)\n
In most places, you have this.\n\nStandard image with tools to manage the systems and the deployment.\n
It’s manifested every day by writing scripts and programs\nto do your repetitive tasks for you.\n\nPeople basically figured out how to do this with whole computers instead.\n\nIn case you’re wondering, that picture is what came up when I googled for\n“lazy sysadmin”\n
In most systems, you worry about the software and installing it on an OS.\nAt Netflix, the smallest thing we worry about is the instance image,\nwhich lives in a cluster.\n\nWe’ve essentially built a platform for doing automated deployment\nof Java code (and some Python too!)\n
My friends Joe and Carl already told you about Nac and our build system.\n\nThis allows the devs to take control of their deployment.\n\nEach team is responsible for their own deployments and uptime.\n\nWhen something breaks, we have a system that lets us page a team\nwho then gets on and fixes their stuff.\n\nEach team is responsible for their own destiny.\n\nSo how do we stay reliable when we have no control?\n\nInformation. \n
This is an example of the task list from NAC.\n\nIt shows me and everyone else all of the actions\n people have taken in our Amazon infrastructure.\n\nthis lets everyone know when it is safe to deploy,\nwhat is going on, etc.\n\nRight now my team is building additional tools to\nprovide information so other teams can make\ngood decisions.\n
\n
\n
\n
\n
\n
In mid August, a hurricane slammed the east coast\nAmazon warned of a possible zone outage.\n
I’m here for you to learn. If you have any questions, please jump in.\nThis will be boring for both of us otherwise.\nAnd I’ve got a couple of hours to fill anyway.\n
You can contact me in one of these ways,\n or ask your question now.\n\nthank you.\n
\n
Speaking of BttF, here is a delorian towing a delorian\n