Maintaining reliability in an unreliable world

Jeremy Edberg

Tweet @jedberg with feedback!

We live in an unreliable
world


Maintaining reliability in
an unreliable world


Netﬂix


A quick rant

• I hate the term “Cloud Computing”
• I use it because everyone else does and no
one would understand me otherwise
• One alternative is “Virtualized Computing”
• I don’t have a better alternative


Agenda

• Money
• Architecture
• Process (or lack thereof)


Reliability and $$


Monthly Page Views and
Costs for reddit
1,300M $130,000.00

1,080M $108,000.00

860M $86,000.00

640M $64,000.00

420M $42,000.00

200M $20,000.00
Mar May Jul Sep Nov Jan Mar


The control and
complexity spectrum
Lots Little


Why did Netﬂix move
out of the datacenter?
• They didn’t know how fast the streaming
service would grow and needed something
that could grow with them
• They wanted more redundancy than just
the two datacenters they had
• Autoscaling helps a lot too

Netﬂix autoscaling
2

Text
1

Trafﬁc Peak


Making sure we are
building for survival


Building for redundancy


1>2>3
Going from two to three is hard


1>2>3
Going from one to two is harder


Build for Three
If possible, plan for 3 or more from the beginning.


“Build for three” is the
secret to success in a
virtual environment


All systems choices
assume some part will
fail at some point.


Database Resiliancy
with Sharding


Sharding
• reddit split writes across four master databases

• Links/Accounts/Subreddits, Comments,Votes
and Misc

• Each has at least one slave in another zone

• Avoid reading from the master if possible

• Wrote their own database access layer, called
the “thing” layer


Cassandra


Cassandra

• 20 node cluster
• ~40GB per node
• Dynamo model


How it works
• Replication factor
• Quorum reads / writes
• Bloom Filter for fast negative lookups
• Immutable ﬁles for fast writes
• Seed nodes
• Multi-region

Cassandra Beneﬁts

• Fast writes
• Fast negative lookups
• Easy incremental scalability
• Distributed -- No SPoF


I love memcache
I make heavy use of memcached


Second class users

• Logged out users always get cached
content.
• Akamai bears the brunt of reddit’s trafﬁc
• Logged out users are about 80% of the
trafﬁc


1 A

2
3

C B


1
D A
2
3

C B


Get - Set - Get


Caching is a good way
to hide your failures


reddit’s Caching


reddit’s Caching
• Render cache (5GB)
• Partially and fully rendered items


reddit’s Caching
• Data cache (15GB)
• Chunks of data from the database


reddit’s Caching
• Data cache (15GB)
• Chunks of data from the database
• Permacache (10GB)
• Precomputed queries

Sometimes users notice
your data inconstancy


Queues are your friend
• Votes
• Comments
• Thumbnail scraper
• Precomputed queries
• Spam
• processing
• corrections

Going multi-zone


Beneﬁts of Amazon’s
Zones

• Loosely connected
• Low latency between zones
• 99.95% uptime guarantee per zone


Going Multi-region


Leveraging Mutli-region

• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money


Other options

• Backup datacenter
• Backup provider


Example ---------


Example 2 ---------


Agenda

• Money
• Architecture
• Process (or lack thereof )


Monitoring

• reddit uses Ganglia
• Backed by RRD
• Netﬂix uses RRD too
• Makes good rollup graphs
• Gives a great way to visually and
programmatically detect errors


Alert Systems
CORE
Paging
Event
Service
alerting Gateway

CORE Amazon
Agent SES
api

CORE
api
Agent

Other
Team’s
Agent


Anatomy of an outage
Life cycle of the common Streaming Production Problem (Productio Problematis genus)

0 Time

Tweet @jedberg with feedback! Slide courtesy of @royrapoport


Something
bad happens

0 Time



Customer
impact
Something
bad happens

0 Time



Customer
impact
Someone notices
Something
(probably CS,
bad happens
hopefully our alerts)

0 Time



Customer Prod
impact alert
Someone notices
Something
(probably CS,
bad happens

0 Time



Customer Prod
impact alert
Someone notices
Something Determine
(probably CS,
bad happens impact

0 Time


Determine
Customer Prod
(nonroot)
impact alert
cause
Someone notices
Something Determine
(probably CS,
bad happens impact

0 Time


Determine
Customer Prod
(nonroot)
impact alert
cause
Someone notices
Something Determine Figure out
(probably CS,
bad happens impact ﬁx

0 Time


Determine Deploy
Customer Prod
(nonroot) ﬁx
impact alert
cause
Someone notices
Something Determine Figure out
(probably CS,
bad happens impact ﬁx

0 Time


Determine Deploy
Customer Prod
(nonroot) ﬁx
impact alert
cause
Someone notices
Something Determine Figure out Recover
(probably CS,
bad happens impact ﬁx service

0 Time


Determine Deploy
Customer Prod Go back
(nonroot) ﬁx
impact alert to sleep
cause
Someone notices
(probably CS,

0 Time


Determine Deploy
(nonroot) ﬁx
cause
Someone notices
(probably CS,

0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time


Determine Deploy
(nonroot) ﬁx
cause
Someone notices
(probably CS,

0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time
outage time


Automate all the things!


Automate all the things

• Application startup
• Conﬁguration
• Code deployment
• System deployment


Automation

• Standard base image
• Tools to manage all the systems
• Automated code deployment


The best ops folks of
course already know
this.


Netﬂix has moved the
granularity from the
instance to the cluster


The next level

• Everything is “built for three”
• Fully automated build tools to test and
make packages
• Fully automated machine image bakery
• Fully automated image deployment


The Monkey Theory

• Simulate things that go wrong
• Find things that are different


The simian army
• Chaos -- Kills random instances
• Latency -- Slows the network down
• Conformity -- Looks for outliers
• Doctor -- Looks for passing health checks
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things

War Stories


April EBS outage


August region failure


Hurricane Irene
The outage that never was


Questions?


Getting in touch
Email: jedberg@gmail.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg

reddit: www.reddit.com/user/jedberg


Maintaining reliability in an unreliable world

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (11)

Similaire à Maintaining reliability in an unreliable world

Similaire à Maintaining reliability in an unreliable world (20)

Dernier

Dernier (20)

Maintaining reliability in an unreliable world

Notes de l'éditeur